Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It’s crucial for decision-making in various fields, including forestry.
I. Introduction to Statistics
- Data: A collection of facts, figures, or information, usually numerical, collected for a specific purpose.
- Raw Data: Data collected in its original form, before any organization or analysis.
- Types of Data:
- Quantitative Data: Numerical data that can be measured or counted.
- Discrete Data: Can only take specific, separate values (e.g., number of trees, number of animals). Often integers.
- Continuous Data: Can take any value within a given range (e.g., height of a tree, temperature, weight).
- Qualitative (Categorical) Data: Data that describes characteristics or categories, not numerical values.
- Nominal Data: Categories with no inherent order (e.g., tree species, forest type).
- Ordinal Data: Categories with a meaningful order but uncertain intervals between them (e.g., timber quality: excellent, good, fair, poor).
- Population: The entire group of individuals, objects, or data that a researcher is interested in studying.
- Sample: A subset or a representative part of the population selected for study. It’s often impractical to study the entire population.
- Variables: Characteristics or attributes that can take on different values.
- Independent Variable: The variable whose variation does not depend on that of another. It’s manipulated or chosen by the researcher.
- Dependent Variable: The variable whose value depends on that of another. It’s measured or observed.
- Descriptive Statistics: Summarizes and describes the characteristics of a dataset (e.g., mean, median, mode, standard deviation).
- Inferential Statistics: Uses sample data to make inferences or predictions about a larger population (e.g., hypothesis testing, confidence intervals).
II. Measures of Central Tendency
These measures locate the center of a dataset.
1. Mean (Arithmetic Mean)
- Definition: The average of all values in a dataset. Calculated by summing all values and dividing by the number of values.
- Formula:
- For ungrouped data: $\bar{X} = (\sum X) / n$
- $\bar{X}$ (X-bar) = Mean
- $\sum X$ (Sigma X) = Sum of all values
- $n$ = Number of values
- For grouped data (frequency distribution): $\bar{X} = (\sum fX) / (\sum f)$
- $f$ = Frequency of each class/value
- $X$ = Midpoint of each class (for continuous data) or value (for discrete data)
- Properties:
- Sensitive to extreme values (outliers).
- Used for quantitative data.
- Most commonly used measure.
- Highlight: “Mean” is the “Average.”
2. Median
- Definition: The middle value of a dataset when arranged in ascending or descending order.
- Calculation:
- Odd number of values: The median is the $(n+1)/2$-th value.
- Even number of values: The median is the average of the $(n/2)$-th and $((n/2)+1)$-th values.
- Properties:
- Not affected by extreme values (resistant to outliers).
- Can be used for quantitative and ordinal data.
- Highlight: “Median” is the “Middle.” (Remember M for Middle)
3. Mode
- Definition: The value that appears most frequently in a dataset.
- Properties:
- Can be used for any type of data (quantitative or qualitative).
- A dataset can have no mode, one mode (unimodal), two modes (bimodal), or more (multimodal).
- Not affected by extreme values.
- Highlight: “Mode” is “Most Frequent.”
Table: Comparison of Measures of Central Tendency
| Feature | Mean | Median | Mode |
|---|---|---|---|
| Calculation | Sum values / Count | Middle value (ordered data) | Most frequent value |
| Effect of Outliers | Highly affected | Minimally affected | Not affected |
| Data Types | Quantitative (interval, ratio) | Quantitative, Ordinal | All (nominal, ordinal, quantitative) |
| Uniqueness | Always unique | Always unique | May not exist, or multiple modes |
| Best Used When | Data is symmetrical, no extreme outliers | Data is skewed, outliers present | Finding most common item, nominal data |
III. Measures of Dispersion (Variability)
These measures describe how spread out or varied the data points are.
1. Range
- Definition: The difference between the highest and lowest values in a dataset.
- Formula: Range = Maximum Value – Minimum Value
- Properties: Simple to calculate but highly affected by outliers. Only considers two values.
2. Variance
- Definition: The average of the squared differences from the mean. It measures the spread of data points around the mean.
- Formula (Population Variance): $\sigma^2 = (\sum (X – \mu)^2) / N$
- $\sigma^2$ (sigma-squared) = Population Variance
- $\mu$ (mu) = Population Mean
- $N$ = Population size
- Formula (Sample Variance): $s^2 = (\sum (X – \bar{X})^2) / (n – 1)$
- $s^2$ = Sample Variance
- $\bar{X}$ = Sample Mean
- $(n – 1)$ is used for unbiased estimation of population variance from a sample. This is called Bessel’s Correction.
- Properties: Units are squared, making interpretation difficult.
3. Standard Deviation
- Definition: The square root of the variance. It measures the typical distance of data points from the mean.
- Formula (Population Standard Deviation): $\sigma = \sqrt{(\sum (X – \mu)^2) / N}$
- Formula (Sample Standard Deviation): $s = \sqrt{(\sum (X – \bar{X})^2) / (n – 1)}$
- Properties:
- Expressed in the same units as the original data, making it easier to interpret than variance.
- A larger standard deviation indicates greater variability.
- Most widely used measure of dispersion.
- Highlight: “Standard Deviation” gives the “Average Distance from the Mean.”
4. Quartiles and Interquartile Range (IQR)
- Quartiles: Divide an ordered dataset into four equal parts.
- Q1 (First Quartile): The value below which 25% of the data falls. (Also known as 25th percentile).
- Q2 (Second Quartile): The median; the value below which 50% of the data falls. (Also known as 50th percentile).
- Q3 (Third Quartile): The value below which 75% of the data falls. (Also known as 75th percentile).
- Interquartile Range (IQR): The range of the middle 50% of the data.
- Formula: IQR = Q3 – Q1
- Properties:
- Not affected by extreme values (resistant to outliers).
- Useful for skewed distributions.
IV. Probability
Probability is the measure of the likelihood that an event will occur. It’s a numerical value between 0 and 1, where 0 indicates impossibility and 1 indicates certainty.
1. Basic Concepts
- Experiment: An action or process that leads to one or several possible outcomes (e.g., tossing a coin, rolling a die, selecting a tree for measurement).
- Outcome: A single possible result of an experiment (e.g., getting a Head, rolling a 3).
- Sample Space (S): The set of all possible outcomes of an experiment (e.g., S = {Head, Tail} for coin toss; S = {1, 2, 3, 4, 5, 6} for die roll).
- Event (E): A subset of the sample space; a particular outcome or a collection of outcomes (e.g., Event A = getting an even number {2, 4, 6}).
2. Types of Probability
- Classical Probability (A Priori): Based on logical reasoning or equally likely outcomes.
- Formula: P(E) = (Number of favorable outcomes) / (Total number of possible outcomes)
- Example: Probability of rolling a 4 on a fair die is 1/6.
- Empirical Probability (A Posteriori / Relative Frequency): Based on observed data or experiments.
- Formula: P(E) = (Number of times an event occurred) / (Total number of trials)
- Example: If a forester observes 20 diseased trees in a sample of 100, the empirical probability of a diseased tree is 20/100 = 0.2.
- Subjective Probability: Based on personal judgment, experience, or intuition. Often used when classical or empirical methods are not applicable (e.g., forester’s estimate of the probability of a wildfire).
3. Key Terms and Rules
Complement of an Event (E’): All outcomes in the sample space that are not* in Event E.
- P(E’) = 1 – P(E)
- Example: If P(raining) = 0.3, then P(not raining) = 1 – 0.3 = 0.7.
- Mutually Exclusive Events: Events that cannot occur at the same time (i.e., they have no common outcomes).
- Example: Getting a Head and a Tail on a single coin toss.
- Addition Rule for Mutually Exclusive Events: P(A or B) = P(A) + P(B)
- Non-Mutually Exclusive Events: Events that can occur at the same time (i.e., they have common outcomes).
- Example: Drawing a King or a Red card from a deck. (King of Hearts and King of Diamonds are both King and Red).
- General Addition Rule: P(A or B) = P(A) + P(B) – P(A and B)
- Independent Events: The occurrence of one event does not affect the probability of the other event occurring.
- Example: Tossing two coins: result of first coin doesn’t affect the second.
Multiplication Rule for Independent Events: P(A and B) = P(A) P(B)
Dependent Events: The occurrence of one event does* affect the probability of the other event occurring.
Example: Drawing two cards without replacement* from a deck.
- Conditional Probability: The probability of an event B occurring given that event A has already occurred.
- Formula: P(B|A) = P(A and B) / P(A) (where P(A) > 0)
Multiplication Rule for Dependent Events: P(A and B) = P(A) P(B|A)
4. Permutations and Combinations
These are methods for counting the number of ways events can occur.
- Permutations: Order matters. Used when arrangement is important (e.g., arranging letters, selecting members for specific roles).
- Formula (n objects taken r at a time): $P(n, r) = n! / (n-r)!$
- $n!$ (n factorial) = $n \times (n-1) \times … \times 2 \times 1$. $0! = 1$.
- Example: Number of ways to arrange 3 letters (A, B, C) in a row: $P(3, 3) = 3! = 6$ (ABC, ACB, BAC, BCA, CAB, CBA).
Combinations: Order does not* matter. Used when selecting a group of items where the arrangement within the group is not important.
Formula (n objects taken r at a time): $C(n, r) = n! / (r! (n-r)!)$
Example: Number of ways to choose 2 letters from (A, B, C): $C(3, 2) = 3! / (2! 1!) = 3$ (AB, AC, BC).
Highlight for Permutations vs. Combinations:
- Permutations: Think of Position (order matters).
- Combinations: Think of Choosing (order doesn’t matter).
Mnemonic to remember main probability rules:
- OR makes you ADD (Addition Rule)
- AND makes you MULTIPLY (Multiplication Rule)
V. Data Presentation
Effective presentation makes data understandable and interpretable.
1. Tabular Presentation
- Frequency Distribution Table: Organizes raw data into classes or categories and shows the number (frequency) of observations falling into each class.
- Includes: Class intervals, tallies (optional), frequency, relative frequency, cumulative frequency.
- Relative Frequency: Proportion of observations in a class (frequency / total observations).
- Cumulative Frequency: Sum of frequencies up to a particular class.
2. Graphical Presentation
- Bar Chart: Used for qualitative (categorical) or discrete quantitative data. Bars are separate to emphasize distinct categories. Length of bar represents frequency/count.
- Histogram: Used for continuous quantitative data. Bars are adjacent to indicate continuity. The width of the bar represents the class interval, and the area of the bar represents the frequency.
- Pie Chart: Used to show parts of a whole for qualitative data. Each slice represents a proportion of the total. Not recommended for many categories.
- Line Graph: Shows trends over time or continuous data. Points are plotted and connected by lines.
- Scatter Plot: Shows the relationship between two quantitative variables. Each point represents a pair of values. Used to suggest correlation.
VI. Correlation and Regression (Brief Overview)
- Correlation: Measures the strength and direction of the linear relationship between two quantitative variables.
- Correlation Coefficient (r): Ranges from -1 to +1.
- +1: Perfect positive linear relationship.
- -1: Perfect negative linear relationship.
- 0: No linear relationship.
- Caution: Correlation does not imply causation.
- Regression: Describes the nature of the relationship (equation) between two or more variables, allowing for prediction.
- Simple Linear Regression: Fits a straight line to the data (Y = a + bX).
- Y = Dependent variable
- X = Independent variable
- a = Y-intercept
- b = Slope
- Used to predict the value of one variable based on the value of another.
VII. Key Statistical Symbols
- $\mu$ (mu): Population Mean
- $\bar{X}$ (X-bar): Sample Mean
- $\sigma$ (sigma): Population Standard Deviation
- $s$: Sample Standard Deviation
- $\sigma^2$ (sigma-squared): Population Variance
- $s^2$: Sample Variance
- $N$: Population Size
- $n$: Sample Size
- $\sum$ (Sigma): Summation
- $P(E)$: Probability of Event E
- $P(E’)$: Probability of Complement of Event E
- $P(A \text{ or } B)$: Probability of A or B (Union)
- $P(A \text{ and } B)$: Probability of A and B (Intersection)
- $P(B|A)$: Probability of B given A (Conditional Probability)
VIII. Practical Application in Forestry (Relevance for Forester Exam)
- Sampling: Selecting plots for tree measurements, animal counts, soil analysis.
- Descriptive Statistics: Summarizing tree heights, diameters, timber volume, species distribution in a forest stand.
- Inferential Statistics: Estimating total forest volume from sample plots, comparing growth rates under different silvicultural treatments, predicting forest fire risk.
- Probability: Assessing the likelihood of pest outbreaks, disease spread, success of reforestation efforts, or predicting timber yield under varying conditions.
- Data Presentation: Creating graphs and tables to communicate forest inventory results, ecological trends, or management plans.
Final Revision Tip: Understand the purpose behind each statistical measure and technique. Don’t just memorize formulas; know when and why to use them. Pay close attention to distinguishing between sample and population notations, especially for variance and standard deviation.