go back (09.08.03/03.22.06)
DrTim homepage link
CH3 Summarizing Data
(Nominal, Ordinal, Numerical)
Intro to Statistics by Timothy Bilash MD
September 2003
www.DrTimDelivers.com
based on:
Review of Basic & Clinical Biostatistics by
Beth Dawson, Robert Trapp (2001) CH3
- SCALES OF MEASUREMENT
- Nominal scale (placed in (named) categories, grouped, qualitiative)
- organized by an attribute
- is a member of a group or not (discrete membership)
- count the number of observations with or without an attribute that fit the category
- percentages or proportions used
- often represented by frequencies by category rather than the raw numbers by category
- often displayed in contingency tables, dividing up into numbers which meet each selected criteria
- Ordinal scale (nominal scale plus groups are also put some order, semi-quantitative)
- organized by an attribute that is more or greater than
- diference between categories is not the same thruout the scale
- percentages or proportions used
- sometimes summerized by median values
- example is rank-order from highest to lowest
- Numerical scale (characterized by a number, quantitative)
- interval or continuous
- characterize by a number that varies in some continuous way
- discrete
- characterize by integer number values
- may be used to describe a continuous scale
- MEASURES OF NOMINAL (GROUPED) DATA
- Proportion
- part divided by the whole (ie number divided by the total number)
- Percentage is a proportion multiplied by 100%
- useful for ordinal, numerical and nominal data
- dimensionless
- Ratio
- part divided by another part
- (number / with___) a given characteristic is an example
(number / without)
- dimensionless, no units
- Rate
- a ratio times a multiplier (read as rate per units of the multiplier factor)
- for a given unit time interval (ratio per unit time)
- Crude rate
- rate for the whole population (everyone)
- affected by the distribution of characteristics in the population that affect the rate
- Sex-specific rate
- restricted to a given sex
- Cause-specific rate
- restricted to a given cause
- Age adjusted rate
- adjusting the rate weighted by the age of the people at risk
- allows comparing changes in a given population over time (since ages and numbers in each age group changes over time)
- Mortality Rate
- (number who died____________) in the given time interval
(number who are at risk of dying)
- number at risk halfway thru the period often used for the denominator as an estimate
- Infant Mortality Rate
- number of infants who die before age1 per 1000 live births
- Morbidity Rate
- (number who develop a disease) in the given time interval
(number at risk of a disease.........)
- Incidence Rate (per time interval)
- (number of new cases___________________) in the given time interval
(number at the beginning of the time interval)
- Standardized Rates
- rates have to be adjusted for the distribution in order to be compared with other populations
- assumes weighted averages maintain the validity of the comparison
- often adjusted by each category to one of the populations as the reference population (direct method of rate standardization)
- standardized mortality ratio (indirect method of rate standardization)
- (number of deaths_________)
(number of expected deaths)
- expected deaths calculated in each population using the specific rates from a standard population as a reference
- no dimensions
- Prevalence (is a proportion and not a rate since no time interval, a snapshot)
- (number with a disease at a moment in time)
(number at risk at that same moment in time)
- MEASURES OF NUMERICAL DATA
- Distributions (the spread of Numerical Data)
- Scale (how closely spaced)
- Shape (how the frequency changes along the scale)
- symmetric (evenly distributed about some middle)
- skewed (not evenly distributed about some middle)
- Measures of the Middle (Statistical Tests on Numerical Data)
- Arithmetic Mean or Mean (average of the data values)
- weighted average (used to calculate the mean if frequencies of a measure are reported and not the raw data)
- used for numerical, symmetric data
- Median (middle or midpoint observation, which half are smaller and half are larger)
- less sensitive to extreme values or shape of distribution of data than the mean
- used for ordinal, or numerical data if the distribution is skewed
- Mode (most frequent value)
- often a range if data grouped in intervals
- used primarily when distribution is bimodal
- Geometric Mean (average of the squares of the data values)
- used when data is measured on a logarthmic scale
- logarithm of the geometric mean is equal to the mean of the logarithms of the data values
- Measures of Spread or Dispersion (Statistical Tests on Numerical Data)
- Range (spread of lowest to highest data value)
- emphasizes the extreme values
- Percentiles
- divides up the distribution into percentile intervals
- allows comparison of data to a norm
- Normal or Standard Distribution
- if the distribution is a Bell Shape
- Standard Deviation (SD, how data clusters about the mean)
- average square of deviations from the means
- (similar average absolute values of deviations from the mean not used)
- used with a mean
- Variance is the SD squared
- Coefficient of Variation
- Standard Deviation adjusted or normalized by dividing by the mean value
- makes it possible to compare distributions of different ranges and means
- Degrees of Freedom
- number of observations minus one
- number of independent variables
- COMPARING TWO OR MORE CHARACTERISTICS
- Nominal Data (grouped items)
- Basic Inquiry Statistics (refer to Fig Definitions of Symbols for Nominal Relationships)
- Experimental Event Rate in Exposed (ERR)
- ERR = A
..........A+B
- number with risk factor who have or develop the outcome
- Control Event Rate in Unexposed (CRR)
- CRR = C
...........C+D
- number without risk factor who have or develop the outcome
- Absolute Risk Reduction/Increase (ARR/ARI)
- ARR = ERR - CRR
- way to appraise the reduction in risk compared to the baseline risk
- events avoided per 10,000 people
- Number Needed to Treat/Harm (NNT/NNH)
- NNT = 1/ARR = 1/(ERR - CRR)
- number needed to treat to prevent one event
- Relative Risk Reduction (RRR)
- RRR = ARR/CR
- can obtain ARR if multiple RRR by the Control Event Rate (RRR * CRR)
- Measures of Significance for Nominal Data (more Inquiry Statistics)
- Relative Risk ratio (RR) **
- ratio of the outcome with the risk factor/exposed and without the risk factor/not exposed
- RR =ERR/CRR
- A
A+B
C
C+D
- investigator decides the number of subjects with and without risk factors
- from risk factor to outcome (inquiry statistics are forward in time)
- contrast to Odds Ratio
- so can be calculated only from a cohort or clinical trial
- cohort and clinical trial is also forward in time (from causes to outcomes) so appropriate measure
- persons with and without risk factor followed over time to determine which persons develop the outcome of interest
- Odds Ratio (OR) **
- (odds that a person with an adverse outcome was exposed or at risk prior to the outcome) divided by the (odds that a person without an adverse outcome was not exposed or at risk)
- OR = (A/C)/(B/D) = AD/BC (see also fig Definitions of Symbols for Nominal Relationships)
- also called cross-product ratio
- investigator decides the number of subjects with and without disease outcome
- from outcome to risk factor (inquiry statistics are backward in time)
- contrast to Relative Risk
- Odds ratio usually used for case-control studies
- case-control study is also backward in time (from outcomes to causes) so appropriate measure
- logistic regression can also be interpreted in terms of odds ratio in addition to relative risk (see chapter 8)
- OR is non-linear, so exaggerates extremely high and low odds
- matching measure statistic (RR, OR) to appropriate study type (Cohort, Trials, Case-control)
- each observational study type differs in both its Tense and Direction of Inquiry Statistics
- Tense:
- in the future=prospective
- in the past=retrospective
- Inquiry Statistics Direction:
- forward (risk to outcome)
- backward (outcome to risk)
- each measure statistic also differs in its Direction of Inquiry Statistics
- RR Relative Risk (forward in time)
- OR OddsRatio (backward in time)
- matching appropriate statistic (see fig observational studies)
- case-control is ..retrospective-backward (use OR)
- cohort is ....................prospective-forward (use RR)
- historical cohort is retrospective-forward (use RR)
- Ordinal Data (ranked items)
- Spearman rank correlation (rs = -1 to +1)
- for 2 ordinal, one ordinal and one numerical, or numerical variables if the data is skewed
- rank order the data (a derived statistic) from lowest to highest by some characteristic
- ranks then used to calculate the statistic instead of the data
- +1 or -1 indicates perfect agreement between the ranks (order) of the values, but not the values themselves
- tedious
- Numerical Data (raw numbers)
- Correlation Coeficient (r = -1 to +1)
- between two numbers
- independent of units
- greatly influenced by outlying (extreme) data values, so not good for skewed data
- use a transformation that changes of scale before the correlation is computed
- these transformations provide a weighted correlation
- rank or logarithmic transformations
- crude rule of thumb for interpreting correlation r (same for negative r's)
0.00 to 0.25 ...(little or no relationship)
0.25 to 0.50....(fair degree of relationship)
0.50 to 0.75 ...(moderate to good relationship)
0.75 or >........ (very good to excellent relationship)
- correlations of r=0.95 to 1.00 are suspect and may indicate artifact or error in the biological sciences
- only measures a straight line correlation, not if a curvilinear relationship where one changes more as the other changes
- "correlation does not imply causation"
- must justify by experimental observations or logical argument
- coefficient of determination (r2)
- indicates percent of shared causation between characteristics (accountability or predictability)
- TABLES CAN GIVE MISLEADING PERCENTAGES - a common error
- how the data is presented and the scaling affects the interpretation
- when two or more measures are of interest, the purpose of the study generally determines which measure is viewed within the context of the other (which is dependent and independent).
- can imply a causality that is misleading if normalize data to the outcomes rather than the risks.
- alters the percentage statistic relative to a different denominator, as part of a different whole
- difficult to keep clear, each a percent of which part, and in which causal direction?
- this example (Table 3-23 p 57) shows that data, collected on compliance and insurance status from a survey at one point in time, indicates that 35% of patients with a low level of compliance have no insurance (A. table), and 55% of patients with no insurance have low compliance (B. Table).
- shows if the data is consistent with the interpretation, but doesnt prove causation (here results are calculated on a survey at one point in time, so causality is a confusing question. especially when groups are subdivided, the ability to distinguish causality as opposed to coincidence is diminished. The table for a coincidence look exactly the same as the table for a causality.
- can also imply causality that is false, if invert the table to display something in the reverse direction that was causal only in the forward direction (C. Table).