www.DrTimDelivers.com

-----

**Inferential Methods**

and in turn*Fundamental goal in statistics is to make inferences (*__assertions)__about a large population from a__sample subset__, especially__using small sample size__,*whether the population parameters represented by the*__sample statistics reflect the typical individual.__**Elements of subjective interpretion are always present in this process.**

TDB note:__An additional goal in medicine is whether the typical individual represents the patient.__**Physicians practice clinical medicine (clinical-medical significance)**, where statistics is just one of the pieces used to diagnose and treat. "Evidence-based medicine", in contrast, presumes that all individuals are represented by the same sample statistic without judgements about individual variation.

**Implication vs Inference [ANOTE]**

**distinguish**.from*Uni-Directional*__Bi-Directional__causality

**Uni-Directional**Causation is an, when a*Implication Forward*or*Inference Backward**risk implies forward***or**an outcome infers backward*(one without requiring the other, not both).*

**Bi-Directional**Causation is an, when a__Equality__(one requires the other).*risk implies***and**an outcome infers from both directions

- These causalities are often confused. The truth of an
*inverse implication (inference = true)*is not equivalent to the forward implication*(implication = true)*. It is more likely true unders some circumstances, but does not have to be, and is a common fatal error of circular logic and decision making.

**Normality**

**Jacob Bernouli**and**Abraham de Moivre**first developed**approximations to the normal curve 300 years ago**(p3-5)- Further development of the normal curve for statistics by
**Laplace**(random sampling) and**Gauss**(normal curve assumption) - Gauss assumed the mean would be more accurate than the median (1809)
- showed by implication that the observation measurements arise from a normal curve if the mean is most accurate
- used
**circular reasoning**though: there is no reason to**assume**that the mean optimally characterizes the observations. it was a convenient assumption that was in vogue at the time, since no way was clear to make progress without the assumption. - Gauss-Markov theorem addresses this

- Laplace & vGauss methods were slow to catch on
- Karl Pearson dubbed the bell curve "normal" because he believed it was the natural curve.
- Practical problems remain for methods which are based on the normal curve assumption

- Further development of the normal curve for statistics by
- Another path to the normal curve is through the
**least squares principle**. (p5-6)- does not provide a satisfactory justification for the normal curve, however
- although observations do not always follow a normal curve, from a mathematical point of view it is extrememly convenient

- "Even under arbitrarily small departures from normality,
." [p1-2]*important discoveries are lost by assuming that observations follow a normal curve*- poorer detection of differences between groups and important associations among variables of interest if not normal
- the magnitude of these differences can also be grossly underestimated when using a common strategy based on the normal curve
- new inferential methods and fast computers provide insight

**Pierre-Simon Laplace method of the**__Central limit theorem__

- Prior to 1811, the only available framework for making inferences was the method of inverse probablility (Bayesian method). How and when a Bayesian point of view should be employed is still a controversy.
- Laplace's method dominates today, and is based on the
**frequentist point of view**using confidence intervals for samples taken from the population (how often a confidence interval around a sample value will contain the true population value, rather than how often an interval around the true population value will contain the sample value). - Laplace's method provides reasonably accurate results under
*random sampling**,*provided the*number of observations is sufficiently large*,in the population.*without the need for an assumption of normality* - Laplace method is based on sampling theory. It assumes that the plots of
__observation means for samples__taken from the population have a__normal distribution__. It is a sampling theory. is an additional assumption often made, violation of which causes*homogeneity of variance**serious*problems

**Seek a single value to represent the typical individual**from the distribution- Area under a probability density function is always one
- Laplace distributions example (peaked discontinuous slope at maximum)
- many others

- Probability curves are never exactly symmetric
- often a reasonable approximation
*asymmetry is often a flag for poor representation by mean*

- Area under a probability density function is always one
**Finite sample breakdown point of sample statistic**- an outlier is an
*unusually large or small value*for the outcome compared to the mean - breakdown point
on a statistic__quantifies the effect of outliers__ - breakdown point is the
that make the sample statistic arbitrarily large or small (number out of n points, see below)__proportion of outliers__

- an outlier is an
__Mean__- mean is the
*simple average of the sample values*(ybar) **ybar =****SUM****[y**_{i}]/n**mean is highly affected by outliers.**- extreme values (Outliers) dominate the mean.
*a*__single outlier__*can cause the mean to give a highly distorted value for the typical measurement*.- arbitrarily small subpopulation can make the mean
*arbitrarily small or large* - has smallest possible breakdown point (1/n). 1 observation or a single point can dominate the mean. it is the most affected by outliers which is not desireable, and why average is easily biased.

**Weighted Mean**- each observation is multipled by a constant and averaged
- any weighted mean is also dominated by outliers
- sample breakdown point of
**weighted mean**(1/n) is same as for mean (if all weights are different from zero).

- mean is the
__Median__- involves a type of "
", or ignoring contributions of some of the data*trimming* **also orders**the observations (which invalidates many statistical tests because violates random sampling)**eliminates**all highest and lowest values to find the middle one- has highest possible breakdown point (1/2). n/2 observations needed to dominate the median, the least affected by outliers (desireable)

- involves a type of "
(p20-22, 59, 246)__Variance__

- A general desire in applied work is
for a collection of numbers which represent a distribution.**measuring the dispersion or spread**- More than 100 measures of dispersion have been proposed
**Variance**is one measure of the spread about the mean

=*Population Variance***σ**^{2 }__average squared deviation__from the__population mean__for**single observation**(p245)**σ**^{2 }=**SUM****[(population value - population mean)**^{2 }* (probablity of value)] = expected value**of****σ**^{2}**s****=****SQRT****[****σ**^{2}] =of an__Population Standard Deviation__observation from the population mean**single**

- convenient when working with the mean
- the Population Variance &sigma (for all in the population) is rarely known in practice, but can be
__estimated by the Sample Variance S__of observations in a sample from that population (see below)

__Sample Variance__*=**S*^{2 }- the
__average squared deviation__from the__sample mean__for(p21, see below)**single observation****S**^{2 }=**SUM****[dev]**^{2}/(n-1) =**SUM(****[sample observations-sample mean]**^{2})/(n-1)**S**^{2}**= [(y**_{1}-y_{bar})^{2}**+ ... + (y**_{N}-y_{bar})^{2}] / (n-1)- Technically distinguished One Sample Mean from Many Sample Variance, calculated as the squared deviation from the Grand Mean. For normal distributions, the Many Sample Variance is the sum of the single sample variances.
**n = # in sample**, n reduced to**(n-1)**to adjust for endpoints (independent degrees of freedom)**(y-ybar) = deviation for a single observation in a sample from the sample mean****Sample Variance****refers to a Single Sample Variance (SSV)**. The combined Variance for Many Samples is also called the Sample Variance, and one is an estimate of the other.

**Breakdown point of the Sample Variance**is theas for the Sample Mean (1/n)__same__**a single outlier can dominate the value of the Sample Variance S**^{2}**low breakdown point**of the Sample Variance is especially devastating for determining significance, even when observations are symmetrically distributed around some central value.

- the
=*Variance of the Sample Mean***σ**^{2}/ n- approximates the
__average squared deviation__from the__population mean__for(p39,71)**sample means** - variance of the normal curve that approximates the plot of the sample means from the Population
- it is centered about the Population Mean, and depends critically on the independence of observations
- also called
or*Squared Standard Error of the Sample Means*(SSEM)*Mean Squared Error*(MSE)

- approximates the
=**Standard Error of the Sample Mean (SE or SEM)****SQRT****[****σ**^{2}/ n]**Square root of the Variance****n = # in sample****measures**__precision__of the sample mean- for normal distributions, also
**measures**, the closeness of__accuracy__of the sample mean*Sample Mean*to the*Population Mean*(makes the sum of the squared Standard Errors equal the square of the summed Standard Errors)

- A general desire in applied work is
**Estimating Population Parameters by Sample Statistics**(p59, also see ahead)

is estimated by__Population Mean__(µ)**(ybar)**the**Sample Mean**(also by the Grand Mean)**Population Variance****(****σ**^{2}**)**is estimated by**( S**^{2}**)**the**Sample Variance**for an observation (single sample)__Population Standard Deviation__(**σ =****SQRT[****σ**^{2}**]****)**is estimated by**(S =****SQRT****[S**, the square root of the sample variance^{2}])**Population Variance of the Sample Mean****(****σ**is estimated by^{2}/ n)**( S**, the^{2}/n )**Single Sample Variance**divided by the number in the sample (the single Sample Estimate of the Variance)

- The terms mean, variance and standard deviation are confusingly used for
__both__*population parameters*and*sample estimates*. The distinction between a population parameter (which is a fixed number) and the sample statistic that estimates it (which is a function of the sample) should always be kept in mind. (Mandel p43)

**Estimates of statistical closeness of fit (precision)**(p22-24)

**Absolute Value Principle (Boskovich 1700)**- calculate the error for value that
*minimizes the sum of the***absolute value errors** **uses**- sample median minimizes the sum of absolute errorsvalue__median__

- calculate the error for value that
**Least Squares Principle (Legendre 1806)**- calculate the error for the value obtained that
*minimizes the sum of the***squared errors** **uses**which minimizes the sum of the squared errors__mean values__- Gauss had used this method earlier but did not publish it until 1809

- calculate the error for the value obtained that
**There are are infinitely many**to measure closeness in addition to these*other ways*- essentially arbitrary which measure is used, so some other criteria must be invoked
- absolute value to a power
**M-estimators**of location (Ellis 1844)- >Sample Mean is a special case of M-estimators
- many others

**Fitting a Straight Line to Data**(**Principle of Regression**p24-28)

**Background**

- any two points (x1,y1), (x2,y2) can be used to determine the slope and intercept of a line

y= intercept + slope*x*or*(y2-y1)= intercept + slope*(x2-x1) - overdetermined algebraic problem
- would get a different line for each pair chosen
- N points yields 2N estimates of the slope and intercept
- discrepancies are due in part to
**measurement errors**

- any two points (x1,y1), (x2,y2) can be used to determine the slope and intercept of a line
(p28)__Simple__Linear Regression (, one outcome)*one*predictor

- measure only the
between the indendent and dependent variable**linear relationship** - use the descrepancy between a
(*proposed line*__predicted__) and the(*data line*__observed__) - descrepancy is called a
*residual*

Residual = r = Y(observed) - Y(expected)

*Absolute Residual Method***(Roger Boscovich 1700**)- minimize the sum of
**absolute residuals** - equivalent to finding the Median

- minimize the sum of
(Legendre 1809) (p28-30)__Least Squares Residual Method__- unclear if Gauss or Legendre first to use
*least squares method* - minimize the
**sum of squared residuals**, instead of absolute residuals**estimated slope turns out to be a**__weighted mean of the Y values__**equivalently estimated slope is also a**__weighted mean of all the slopes__between all pairs of points- when the slope is zero, the sample mean estimates the intercept
- emphasizes values farthest from the mean (a linear method)

*find the weighted mean from the infinitely many, which on average is the most accurate*- represents the slope and intercept if
__no measurement errors__and__infinitely many points__ - weights are determined by the descrepancy of X-values (predictors) from the mean-X-value
- these weights sum to zero, not one
(breakdown point of 1/N for any weighted mean)*a single unusual point, properly placed, can cause the least square estimate of the slope to be arbitrarily large or small.*

- represents the slope and intercept if

- unclear if Gauss or Legendre first to use

__Multiple__Linear Regression (__two or more__, one outcome)*predictors/risks*

**predictor independent variables**can be__numerical or nominal__**outcome dependent variables**must beand__numerical only__*cannot be nominal*- measures only the
between the independent and dependent variables__linear relationship__ __there is ambiguity about the term multiple regression__. some use the term for, rather than*multiple outcomes*or even**multiple risks**,**both multiple outcomes and risks**- Gauss devised the
to handle multiple predictors*method of elimination*

- Represent the shape of population distribution by fitting a "normal" distribution to it

of [minus the**Exponential**from the population mean],**squared deviation**

divided by [the] (ie,*average squared deviation**normalized by the variance*)- 94% of observations are within 2 standard deviations, 68% of observations are within one standard deviation for a normal distribution
**Mean**and**Standard Deviation**completely represent the distribution if it is__normal__. Probabilities are determined exactly by the mean and standard deviation when observations follow a normal

__Population Distribution__versus__Distributions of Samples taken from the Population__

__Population__**Distribution**- values for every member of the population
- summary parameters for population distribution (population mean and standard deviation as example)

__Sampling__Distribution

- repeated, (hopefully) random observations collected in a sample
- sample is subset from total population (values for some members of the population)
- size (number n) in each sample
- "primary" statistics for each sample distribution (sample mean, sample standard deviation and sample variance as examples)

**Precision**

__inferences about the sampling distribution__(Statistics)(ie, spread about sample mean)**how well sample values represent sample statistic**- can estimate precision from the sample data

__Accuracy____Inferences about the population__distribution from the sampling distribution (Statistics + Probablility)(spread of**how well sample statistic represents population parameter**__single observations__or__means__from sample of observations about the population mean), which depends on

**randomness**of sampling**size**(number) in each sample (**n**)

- inferences using
from the population__non-random sampling____may lead to__.*serious**errors*

- estimating accuracy from the sample data requires subjective inference in addition to estimating precision

__Mean, Variance, Standard Deviation for Means for Sample Distributions__

is a "primary" statistic from a__Sample Mean____single Sampling distribution__

for observations from a**Sample Variance**__single sample__is**S**^{2}for observations from a**Sample Standard Deviation**__single sample__is**S**(square root of the Sample Variance)

is a "secondary" or summary statistic derived from the distribution of the Sample Means (statistic based on collection of statistics)__Mean of Sample Means__(Grand Mean)

- Deviations of
__sample means__from the__average of sample means__form a gaussian (normal) distribution if the samples are random

(SEM, SE, or Mean Squared Error) is the__Standard Error of the Sample Mean__, relative to a single sample mean (or average of all sample means, the grand mean) used as an approximation to the population mean**Standard Deviation for the***Means of Samples*

as**Standard Error of the Sample Mean (SE) can be estimated from S****S**. This indicates^{2}/n*how close (related to percentage of all sample means) any one Sample Mean*__approximates the average Sample Mean__, which both__approximate the population mean__(statistic + inference)

*S*^{2}(p21).*is the Standard Deviation Squared (SD*^{2}) for a__sample observation__**(n-1)**is the divisor to calculate (S^{2}) for, from the squared deviations (residuals)__Single Observations__

also estimates the**S**^{2}**Variance of one observation about the Population Mean**

*S*^{2}/n*is the Standard Error Squared (SE*^{2}), the Standard Deviation Squared for a**mean of sample observations****(approximate, S approximates**(p39)*σ)***(n)**is the divisor to calculate (SE^{2}) for, from S**Sample Means**^{2}

estimates the**S**also^{2}/n**Variance of Sample Means about the Population Mean**

**Median Absolute Deviation (MAD) Statistic**

- Want measures of
**location**(representative value) and**scale**(spread around that value) for a distribution, which are not themselves affected by outliers __Median__is one*alternative statistic to using mean (middle value)*- MAD = median of the [absolute value of the (deviations from the median)] (computed from
*median of the deviations,*see Median) **MAD/.6745 estimates the population standard deviation****for a normal probability curve**- MAD is
*less accurate*than the sample standard deviation S (computed from*mean of the squared deviations*) in estimating the population standard deviation**σ**for adistribution__normal__ **Masking**- both the
__sample mean and sample standard deviation are inflated by__*outliers* - increasing an outlier value also increases the mean and standard deviation for that sample, masking the ability to detect outliers
__MAD__**is much**, so__less affected by outliers__(sample breakdown point of 0.5, highest possible)**good for detecting outliers**

- both the
**Outlier Detection using MAD**{approximation for**|X-Median|****= 2.965*MAD**}**:**

to determine Outliers**|X-Median| > [3*MAD]**

- Want measures of

**Central Limit Theorem****for samples containing***large numbers***- due to Laplace**p39)

**Normal Curve**

**plots of sample**provided each mean is based on a reasonably large sample sizeapproximately follow a normal curve,*means*

- this normal curve of sample means would be
.*centered about the population mean* **spread in values obtained from**is called__one sampling____sample variance__**the variance of the normal curve that approximates the plot of the means from each sample**(variation of the means of samples or SSEM)**is estimated by (****σ**^{2}**/ n**) defined using:__population variance__(**σ**^{2 })used to compute the mean__number of observations in a sample__( n )- going in inverse order from inference here (use population to estimate the sample)

- distinguish SSEM of
from the Variance of*many samples**one sample* - non-normality of the parent distribution affects the significance tests for differences of means
*there is no theorem to give a precise size for "reasonably large"*

- Graphs of (left)
__Normal Distribution__and (right)with Expected Normal Distribution from mean and standard deviation (solid), for sample sizes of 20 (fig 3.2 p 34, fig 3.11 p45)*Medians (used here for illustration, similar to a plot from Means)*(dots) of small-sized Samples from Normal Distribution

**[ISLT]**curve characteristics*relative to the normal curve*

(*departure from standard bell shape*,**non-normality***small effect on mean*)- if deviations of dependent values (Y's)

(*skew from symmetry*,**bias***large effect on mean)*- if asymmetry about the mean, one tail is higher than the other at the same distance from the mean

*tails at extreme values*,**falloff***large effect on mean*__and__variance)- if convergence slower than normal distribution

- summary statistics from non-normal distributions depend both on
in the distribution plus the__values____rate of__*change in values**ie,*second or higher derivatives, or amount of skew compared to normal curve)

__Uniform__**distribution****example**(p40)- constant value from 0 to 1 (all values equally likey, between 0 and 1, step function)
- population mean for this curve is 0.5
- population variance is 1/12
and*light tailed**symmetric*- central limit theorem predicts that the plot of random small-sample means is approximately normal, centered around 0.5 mean, with variance 1/12N.
- plot of only
*twenty means of samples gives reasonably good results*(p40 example).- multiple random sample of twenty values used to predict mean
- distribution of these small sample
**means skewed to slightly higher value**

- Graphs of (left)
__Uniform Distribution__and (right)__Means (dots)of Small-sized Samples from Uniform Distribution__with Expected Normal Distribution from mean and standard deviation (solid) (fig 3.5 p40, fig 3.6 p 41)

__Exponential__**distribution****example**(p40)- exp(-x) or exp(x)
- population mean is 1
- population variance is 1/n
and*light tailed*(small samples approximate the normal curve well)*asymmetric*- plot of means from samples of only 20 gives reasonably good results assuming the central limit theorem (random small samples of means approximates a normal curve)
__plot of means of small sample (n=20) is skewed to slightly lower values__**for exp(|-x|) compared to large sample means (follows normal curve)****[ISLT]**will be skewed to*means of small samples from exp(|x|) distribution**higher*values- exp(-|x|) skewed to
*smaller*values, more so than for a uniform distribution (see p38-40)

- exp(-|x|) skewed to
- Confidence Intervals (CI) of small samples from exp(|x|) are decreased
- Confidence Intervals (CI) of small samples from exp(-|x|) are increased

- Confidence Intervals (CI) of small samples from exp(-|x|) are increased

- Graphs of (left)
__Exponential Distribution__and (right)__Means (dots) of Small-sized Samples from Exponential Distribution__with (left) Expected Normal Distribution from mean and standard deviation [solid] (fig 3.7 p41, fig 3.8 p42)

Distribution Example (p76)**Lognormal**

and highly*light tailed**asymmetric*- mean and median are not too far apart since light-tailed, but begins to have problems, especially with Variance estimates. see heavy-tailed "logheavy" below. T-distribution from lognormal begins to have problems because of this.

- Graphs of (left)
__Lognormal Distribution__and (right)__Means of Small-sized Samples from Lognormal Distribution__(green) with Expected Normal Distribution from mean and standard deviation (blue) [from http://www.gummy-stuff.org/normal_log-normal.htm]

**"**__Logheavy__"**distribution****example**(p42)- mean and median are very different (contrast with uniform and exponential, where mean and median are close)
and highly*heavy tailed**asymmetric*- outliers more common
- mean
*not*near the median

- distribution of sample means is
*poorly approximated*by the mean and standard deviation of the sample means (normal approximation) for the "Logheavy" distribution**requires larger numbers**for representation of central limit theorem normal approximation, approaching the population mean as the sample size increases.**[ISLT]**Accuracy of small samples for "logheavy" distribution**skewing (assymetry relative to central tendancy)**,**departure from linearity**, and**fall off towards infinity**all require larger numbers for normal approximation to be accurate (p44)- high curvature decreases accuracy
- slower fall off towards infinity
*compared to normal distribution (outliers) decreases accuracy* - median is different from mean

compared to large samples**plot of means of small samples here are skewed to lower values, between the population median and the sample median,** __however, sample median still estimates the population__, even for skewed distributions*median*since the**small sample mean does not estimate the sample**__mean__for skewed distributions__mean and median are far apart__(population mean far from most of the observations)- faster drop and higher peaked below mean, compensated by slower drop and lower peak above the mean (see fig 3.10)
- contrast to lognormal distribution (light tailed)

- Graphs of (left)
__"Logheavy Distribution__and (right)__Means of Small-sized Samples from "Logheavy" Distribution__(rippled) with Expected Normal Distribution from mean and standard deviation (solid) (fig 3.9, fig 3.10 p43)

*******__"Logheavy" Distribution__is__much better approximated if use the__to approximate population median instead of theof Small-sized Samples**Medians**to approximate population mean (see graph next and p46)**Means**

- mean and median are very different (contrast with uniform and exponential, where mean and median are close)

**Problems with Samples from Non-Normal Distributions: the Mean and the Median**

**Tails (Light, Heavy)**

**Tails are how quickly the probablity curve drops off**from the mean value towards infinite predictor (X)**Uniform**and**Exponential**distributions are(drop off quickly towards infinity compared to normal curve, outliers tend to be rare). However the*light tailed***Uniform Distribution is**and*symmetric***Exponential Distribution is**.*asymmetric***Logheavy**distribution is(drops off slowly towards infinity relative to the mean, outliers are common)**heavy tailed**and**asymmetric**

**[ANOTE]**

**I create terms for the distribution characteristic of "**(no change in__monotonicity__"*sign*of slope over distribution or wiggles in the distribution curve) and__"__, to be considered with the properties of**logheavy**" distribution(reflection about the mean) and__symmetry__(falloff to large/small values) of a distribution__tails__

*Uniform*distribution is .............**light-tailed, symmetric, monotonic**(Fig 3.5 p40)*Normal distribution*is*..............***light-tailed, symmetric, non-monotonic**(Fig 3.2 p34)*Exponential*distribution is .....**light-tailed, asymmetric, monotonic**(Fig 3.7 p41)*Lognormal*distribution is ........**light-tailed, asymmetric, non-monotonic**(Fig 5.3 p 76)*"Logheavy"*distribution is ...**heavy-tailed, asymmetric, non-monotonic**(Fig 3.9 p43)

- these greatly influence

**[ISLT]***the combination of these three characterstics*__tails__,__symmetry__and__tonicity__relative to the normal distribution curve determine the statistical precision and accuracy.__Asymmetric, non-monotonic__distributions appear to have unreliable means and/or variances for small samples.

**Accuracy of sample mean and median**(__how close statistic is to parameter__p44-45, fig 5.5 and fig 5.6 p79-81)

- for
*symmetric*

of small samples__medians and means__**center around the**__mean/median__of a symmetric population- normal curve is symmetric about the mean

- for
probability curves*asymmetric*

__mean__of small samples are closer to the population median, but__means__, requiring larger sample size to estimate the population mean (p42)__slowly converge to an asymmetric population mean as sample size increases__- larger samples are needed using means than using medians
- this is due to the outliers (one outlier can completely dominate the mean)

__median____medians__of small samples center around the asymmetric population__median__- sample median is
__separated from the asymmetric population mean__(in general sample median is__not__equal to the population mean. see "logheavy" distribution example)

__sample medians are a better approximation for the asymmetric population median__, than sample means are for the asymmetric population mean with__small samples__.

- for
probability curves*light tailed*

curve-**symmetric light-tailed****plots of sample means are approximately normal**even when based on only 20 values in a sample (p42)- sample mean is close to the median and a good approximation to the population values for
*symmetric light tailed curves*

- sample mean is close to the median and a good approximation to the population values for
curve**asymmetric light-tailed-****can have poor probability coverage**for the confidence interval, poor control over the probability of a type I error, and a biased Student's T

- for
probability curves*heavy tailed***symmetric****[ISLT]**sample mean is close to median and a*good*approximation to the population, however the estimate of the population*values*will be*variance**inaccurate*for heavy-tailed symmetric curves

**asymmetric**,__sample__*medians*provide a better approximation for the population*median*for heavy-tailed symmetric curves**than do sample***means*for the population*mean*

- plots of sample means converge much more slowly to the mean for heavy- than light- tailed asymmetric distributions (p45)

- for

__Errors are always introduced when use a sample statistic (eg, mean for__*some*individuals from the population) to estimate a population parameter (mean for*all*individuals in the population)

- can calculate the mean squared error of the sample means (
**average squared difference**)- average or mean of the squared [differences between the infinitely many means of samples (sample means) and the population mean]
- called the
**"expected squared difference"**(expected denotes average) - want the
__mean squared error__to be as small as possible - if the mean squared error for the sample mean is small, it does not imply that the standard deviation for a single observation will also be small.
__variance of the means of samples__is called the__squared standard error__of the sample mean.- Gauss and Laplace made early contributions for estimating the errors

- distinguish the sample mean deviation
__relative to the____mean of sample means____relative to the____mean of the population__- sample deviations vs population deviations
- sometimes these are the same depending on the distributions, and often used interchangeably:

**standard deviation of the mean for a sample**

**standard error of the sample mean (standard deviation of the mean of all the sample means)**

**standard deviation of the mean for a population**

- can calculate the mean squared error of the sample means (
**Weighted Means**

__LARGE SIZE__,*Random Samples***(Laplace - Central Limit Theorem)**- under general conditions, the
__central limit theorem applies to a wide range of weighted means__ - as the number of observations increases, and if repeated the experiment billions of times, would get fairly good agreement between the plot of weighted means and normal curve (can use the mean and standard deviation to estimate the curve)
__under random sampling and large numbers__, most accurate estimate*of the population mean is the usual*.__weighted sample mean__, based on the average square distance from the population mean.*each observation gets the same weight (1/N)***Assumptions****assumes**__random sampling__**assumes**enough that the plot of means of the samples would follow a normal curve__sample sizes are large__- does
that the samples were from a normal population curve__not assume__ - does
symmetry in the population curve__not assume__

- does

- under general conditions, the
**SMALL or LARGE SIZE,***Random Samples***(Gauss - General Sample Size)**- derived similar results under weaker conditions than Laplace,
__without__resorting to the central value theorem requiring large samples __under random sampling__*the*optimal*weighted**mean for estimating the population mean is the**usual sample mean*(as for Laplace formulation).*of all the linear combinations of the observations (*__weighted means__*) we might consider,**the*__sample mean is most accurate__*under relatively unrestricted conditions that allow the probability curve to be non-normal, regardless of sample size.***Assumptions**__random sampling only__- does
__not assume__*large numbers*in sample - does
samples were from a__not assume__*normal population*curve - does
__not assume__*symmetry*

- does
- used the
__rule of expected values__

- there are
*problems with this approach for some distributions*

- derived similar results under weaker conditions than Laplace,

**Median and other Classes of estimators****These summary statistics are***outside the class of weighted means*than the sample mean__Sample Median is sometimes more accurate__- requires putting observations in ascending order, as well as weighting the observations

- requires putting observations in ascending order, as well as weighting the observations

**Median vs Mean**- if probability curve is
with*symmetric**random sampling**,**can find a*by looking__more accurate estimate__of the population mean than the sample mean__outside the class of weighted means__

**Mean**- sample mean is
*based on all the values* - nothing beats the mean under normality
- tails of the plotted sample means are closer to the central value than the median, so the sample mean is more accurate representation of the population mean
- [ANOTE] good for normal/symmetric distributions, otherwise unreliable statistic

- sample mean is
**Median****not a weighted mean**- can be slightly more or greatly more accurate in certain situations
- sample median is
, with the rest of the values having zero weight*based on the two middle values* - median is a better estimate of the population mean for Laplace distributions (sharply peaked)
- (better for non-normal, symmetric distibutions)

- if probability curve is
**Regression Curve Fitting (Gauss-Markov Theorem)**(p55)

**Simple regression**(variable X,*one predictor*variable Y)*one outcome*- least squares estimator of the slope/intercept of a regression line is the optimal among a class of weighted means
- does not rule out other measures of central tendancy
**"**refers to a"*homo*scedastic__constant variance__(i recommend the term)*convaried*- population standard deviation (spread in outcome Y) is constant,
*independent of predictor variable*risk X - Gauss showed that
that minimize the expected squared error (expected denotes average)*if variance is constant, then the Least Squares estimator of the slope and intercept is optimal among all the weighted means*

- population standard deviation (spread in outcome Y) is constant,
**"**refers to a"*hetero*scedastic__non-constant variance__(i recommend the term)*non-convaried*- population standard deviation or variation in Y values
*changes with predictor*X - Gauss showed that if the variance of the Y values corresponding to any X were know, then optimal weights for estimating the slope could be determined, and derived this result for
__multiple predictors as well__

- population standard deviation or variation in Y values
- many analysts used the unweighted least squares estimator, assuming a constant variance for the population. however, in some situations,
**using optimal weighting can result in an estimate that is hundreds, even thousands of times more accurate than the unweighted**

__Estimating unknown population parameters__**(making the chain of inference)**

**Each individual is a member of progressively larger (more inclusive) groups**- individual sample contains a subset of individuals in the population
- sample of individuals drawn at random creates a kind of envelope or shape for the distribution which groups them together (sample curve)
- one individual can belong to many different sample groups of the population distribution (different sample curves drawn from the population curve)

- individual in the population is a
__member of the population__as well as a__member of sample groups__- Progression for inference:
**individual, sample, finite group of samples, infinite group of samples, population**

- Progression for inference:

- individual sample contains a subset of individuals in the population
- An extended chain of
**inferences**is required for validity:__individual__*to*__sample__*to*__finite number of samples__*to*__infinite number of samples__*to*__population distribution__*to*__individual in the population distribution__(or__mean of sample__*to*__mean of finite samples__*to*__mean of infinite samples__*to*__mean of population distribution__*to*__individual in the population distribution__)

reflects some characteristic of a__P__arameterof subjects__p__opulationreflects some characteristic of a__S__tatisticfrom that population (primary, or summay secondary)__s__ample- want to approximate the probability that the population parameter (and thus individual) is reflected by sample statistic. a
**summary statistic**from the sample is used to estimate the corresponding population**summary parameter** - When use a sample of individuals from the population to estimate the population parameter (for all individuals), always make an error

__large____number n in each sample__**requires the central limit theorem**(*number of samples N*)

=*P*_{s}**sample estimate**of a population parameter

is the*SE(P*_{s})**standard error of P**(for_{s})__number n in each sample__**= σ/SQRT****(N) ~ S/****SQRT****(n).**Contrast the sample error of the mean from a single sample versus standard error of the mean for limit of infinite number of samples

*CI(P*_{s}, 95%)**= CI(P**_{s}, 95% Confidence Interval) = [P_{s}-1.96*SE(P_{s})] to [P_{s}+1.96*SE(P_{s})] which**has an approximate***95% probability of containing the unknown population parameter*

(p58)__Estimating the Population Mean__

**Laplace General method**

**requires**, which means that all observations in a sample are independent of risks and outcomes__random sampling__**plot of infinitely many**__sample means____assumed to__**follow a normal curve****.**this is reasonable for:**large numbers**in each sample (use central limit theorem)**normal distribution**in population

of infinitely many sample means__find an expression for the____variance__:*make a chain of inferences*- find mean of
*one*sample - estimates the mean of
*many*samples - estimates the mean of infinite number of samples
- estimates the population mean

- find mean of

__Standard Deviation__*******

*confusing terminology*(implied by the context)__sample mean__refers to the average from**one**sample , but also used for the average of the means from**many**(average from single sample vs average from multiple samples)

**SD (Standard Deviation)**can refer to spread in any collection of**individual observations**,__for a sample__,__for a population__it represents, and also__within any grouping__

**standard deviation**for a**sample of**from the population (*individuals***S using the sample mean**)**approximates the**__standard deviation of sample means__from the grand mean**approximates the***population standard deviation σ*

**Confidence Interval**(Laplace 1814 p58)

**Confidence Interval for a Population Parameter P**_{s}

**CI(P**_{s}, 95%) = [P_{s}-1.96*SE(P_{s}) to P_{s}+1.96*SE(P_{s})]**P**=_{s}**sample estimate**of a population parameter**N = number in each sample****SE(P**_{s}) is the standard error of P_{s}- has an approximate
**95% probability of containing the unknown population parameter** - this method assumes homoscedasticity.
*if there is*__heteroscedasticity__, even under normality.*, the confidence interval can be extremely inaccurate*

**Confidence Interval****for the**__population____mean (µ)__

- assumes
**normal distribution**in population and**large numbers****N**in each sample- population mean = (
**µ**) - population variance = (σ
)^{2} - sample mean variance = (
**σ**^{2}**/ N**) is the squared deviations from**µ**summed - assumes sample variance (
**S**) approximates the population variance (**large numbers**in samples, invoke central limit theorem)

- population mean = (
__CI(µ, 95% population mean)__**the interval of +/-1.96 * [SE] = +/-1.96 * [****σ/****SQRT(N) ], has a**.__95% probability of containing the unknown population__*mean***This is an important point, for the confidence interval limits concern only the**The confidence limits are*means*of populations from which the samples are taken.*not*bounds on a proportion of*individuals*from the population, and the projection to any*individual*requires inferences and judgements*beyond*the statistical analysis. In essence the population distribution must be known to do this.- for
**large**enough samples, the sample approximationhas an approximate**+/-1.96 * [S/SQRT(N)]**__95% probability of containing the unknown population mean__ - 95% confidence interval (CI) is used routinely for convenience. could be any other percentage also.
- Laplace routinely used
rather than**3**__1.96__

- assumes
**Confidence Interval****(CI)****for the***slope*__of a regression line__

**Linear Regression is a**(p46)__Weighted Mean__

- repeat an experiment infinitely many times with N sampling points in each experiment
**Least Squares estimate of the slope of a regression line to the experimental (Y) results can be viewed as the weighted mean of the outcome (Y) values**. (see p28-30)

**[ANOTE]**

1)__average of all values__vs__average of means of all samples of values__is the same because linear (a+b)+c = a+b+c , if samples are random and numerous

2) normal distribution also makes the__errors linear__- Laplace made a convenient assumption and hoped that it yielded reasonably accurate results
- assume
__large numbers__ - assume
(homogenous or homoscedasticity, an__constant variance__**important assumption** **if a reasonably large number of pairs of observations is used (Laplace), then get a good approximation of the plotted slopes under fairly general conditions**

- assume

**[ANOTE]**an experiment implies either a__controlled selection__or__random selection__from all the possible selections, with__identification and control of all variables__that may affect the result (are causative), for each observation, or group of observations, or some used function of the observations. defined selections gives results for a defined subgroup, random selections give results for average of random subgroup. these may or may not be equivalent.

**[ISLT]**its not the parent distribution,**but the distribution of errors (for the mean) and square of errors (for the variance) that has to be linear and homoscedastic**

**METHOD****for Least Squares**(p55-62)

**DEFINITIONS**for Least Squares

__y=b+ax__**(theoretical regression line for the population we are trying to estimate)****b**is the slope of the regression line for infinite number of points**a**is the intercept of the regression line for infinite number of points

**n**is the number of data points**S**=_{y}^{2}**sample variance**of all the n number of sample y_{i}values from the sample mean**y**_{i}**=b**_{i}**+a**_{i}**x**_{i}**(line fit for the***ith*sample data point, i=1 to n)**(x**_{i}**,y**is the ith data point_{i})**x**_{i}is the mean predictor value for the ith group**y**_{i}is the mean outcome value for the ith group

__y=d+cx__(arrived at regression line from that sample)**d****estimate of the intercept b for the n data points****c**is the least squares regression**estimate of the slope a for the n data points**

**ASSUMPTIONS**for Least Squares

- assume
**independence of outcome and predictor** **constant variance**in the population (homoscedasticity)- determine that data does not contradict this assumption
- estimate this asssumed common variance (per Gauss suggestion)
- this assumption sometimes masks an association detectable by a method that permits heteroscedasticity

- assume
**RECIPE for Variance of the slope**=**σ**_{c}for outcome results Y from Least Squares (p64)^{2}

**Var(Y) =****S**estimates the assumed_{y}^{2}__common__of each group__Variance__- compute the
__Least Squares estimate of the slope__(a) and intercept (b) for the n points. compute corresponding n residuals of the computed line from the data curve r_{n}. - square each n residuals
- sum the n results
- divide by the number of data pairs of observations minus two (
**n-2**)

- compute the

**Result For SLOPE VARIANCE for Least Squares (case of slope = c )**

**Var(c)**=**σ**_{c}=^{2}__squared standard error__of the least squares estimate(p64)__of the slope c__-
computed using the common variance of the Y (outcome) values
**Var(c)**=**S**_{y}^{2}**/ [(n-1)*S**= ("_{x}^{2}]*squared***standard error of the least squares estimate of the slope**")**σ**_{c}**= SQRT****[Var(c)]****=****S**_{y}**/ (****SQRT****[(n-1)*S**_{x}^{2}**])****=**("")__standard error of the slope__- [ANOTE] squared error is
*linear*for a normal curve

**S**, and the sample variance of the X (predictor) values_{y}^{2}**S**_{x}^{2}

- these methods were developed over the last forty years

- originally thought that standard methods of inference were insensitve to violations of assumptions
**it is more more accurate to say that**, or false positive findings of statistical significance, when performing regression*these methods perform reasonably well in terms of Type I errors*- when groups have
__identical probability curves (shapes)__ - when
__predictor variables are independent__

- when groups have
- that is,
__can detect if two distribution curves are not identical__, but dont know if it is because of a difference in means, difference in distribution shapes, difference in data errors, difference in some other charateristic of the distributions, or processing errors introduced into the distribution. And any comparison of summary data loses precision.- due sampling differences
- due to population differences
- does this negate the ability to say that they are the same, or that they are not the same?

- Even
**worse problems**arise if the regession variables are. (p67)**correlated**

**Hypothesis Testing**

**dicotomization method**- hypothesis testing
__dicotomizes the risk into a yes/no__(for risk present/not) and__outcome into a yes/no__**(for test result correct/incorrect).**it is based on choosing one__predictor variable__(risk)**and requires choosing**. there are also two error levels and two means for normal distributions, so also requires selecting an__null__*and*__alternative__distributionsand a__alpha error level__to dicotomize into 2 from the 6 parameters.__beta error level__ - choose
*null*_{o})(H*alternative*_{a}) hypotheses to provide a**binary variable**(if__null__, then__not alternative__). since two parameters determine the assumed gaussian shape for each hypothesis distribution, picking two parameters forces a binary yes/no for comparing the means. *there is much potential confusion*, because of the use of multiple negations, and**whether describe values relative to the null or relative to the alternative hypotheses**(either description is*equivalent*).

**DICOTOMIZATION**allows for**4 possible Test/Reality Situations**, often shown as a 2x2 table (see A):

Reality(R) is True .., Test(H) is Positive.. (**True**...**Positive,**correct assertion)

Reality(R) is False , Test(H) is Negative (**True**..**Negative,**correct assertion)

Reality(R) is False , Test(H) is Positive.. (**False**.**Positive,**type1 alpha error)

Reality(R) is True .., Test(H) is Negative (**False Negative,**type2 beta error)

The actual Reality is not usually known, of course, which is the purpose of the test.

- There are equivalent ways to display the same hypothesis relationship, so careful attention is demanded to how the items are actually arranged. For example, the 2x2 table [A] can be equivalently diagrammed by reversing Reality and Hypothesis, row with column [B].

A(left), B(right)

- Still more ways diagram the findings in terms of the alternative hypothesis, rather than the null hypothesis:

C(left),D(right)

- The R
_{0}& H_{0}can be interchanged within a row or column as well, and all the previous can be recast (shown here only for A as E):

E

**It is important to discuss the two statistical errors in hypothesis testing (mucho confusing terminology)**

**Type I error**(see graphic of Hypothesis Testing below)

of difference from expected null (here alpha is used interchangeably for the cutoff level and probability in the null tail)__alpha error, or "false finding"__- false discard/rejection of null assumption, a false positive
, or*discard the null and accept the incorrect alternative*hypotheses__erroneously not retain the correct null__**hypothesis**(hypothesis testing implies__two__conditions not just one)- acceptance of alternative caused by chance when populations are actually the same,
__populations seem different but are not__(a Type I error) - distribution for the null determines the probability of an alpha error for given alpha cutoff
- choosing an alpha cutoff level for the null distrbution
(for a given alternative distribution and a given null mean. the alpha cutoff for the null is the beta cutoff for the alternative.)__also fixes beta, the chance might erroneously__hypothesis*reject the false alternative*

**Significance****Null Probability**related to alpha error)__significance is how often correct when DO see effect__*determined by alpha cuttoff and null distribution*, correct retention/acceptance of null, true-negative of a difference from expected null*true non-finding*- significance is (
**1-alpha**)*chance to**correctly retain the true null hypothesis* - 95% null significance level (
**1-alpha**) for a 5% Type I error level (**alpha**)

**Type II error**(see graphic of Hypothesis Testing below)

of a difference from expected null (here beta is used interchangeably for the__beta error, or "false ignoring__"__cutoff level__= X_{c}and__probability in the alternative tail__= function of X_{c})- false retention/acceptance of null, a false negative
, orhypothesis__retain the null and reject the correct alternative__hypothesis (hypothesis testing implies__erroneously not discard the incorrect null____two__conditions not just one)- retention of null caused by chance when populations are actually different,
__populations seem the same but are not__(a Type II error) - distribution for the alternative determines the probability of an beta error for given cutoff
- choosing a beta cutoff level for the alternative distribution
__also fixes alpha, the chance might__(for a given null distribution and alternate mean. the beta cutoff for alternative is the alpha cutoff for the null.)**erroneously**,*retain the false null*hypothesis

**Power (Alternative Probablity) related to beta error**__power is how often correct when DO NOT see effect__**determined by beta cutoff and alternative distribution**, correct rejection/non-retention of null, or true-positive of a difference from expected null*true finding*- power is (
**1-beta**)*chance to***correctly accept the true alternative hypothesis** - 95% alternative power level (
**1-beta**) for a 5% Type II error level (**beta**)

**[IMPORTANT ANOTE]**

- Technically, the null and/or alternative can be negated, which recasts the expected null as false and
alternative as true or whatever combination of the negations. These possibilities make craziness when following the logic of
hypothesis testing. I have cast the issue here in the one way only for clarity and sanity. But perhaps research results would
be much clearer if a standard approach was adopted and how one has to pay attention to this nitty gritty minutiae.

**If the observations are***correlated*, then the alpha and beta probablities are highly inaccurate (not discussed here). this violates the random selection assumption of summary statistics and is an important factor biasing results.

- Technically, the null and/or alternative can be negated, which recasts the expected null as false and
alternative as true or whatever combination of the negations. These possibilities make craziness when following the logic of
hypothesis testing. I have cast the issue here in the one way only for clarity and sanity. But perhaps research results would
be much clearer if a standard approach was adopted and how one has to pay attention to this nitty gritty minutiae.
**Graphic Illustration of Hypothesis Testing Errors**

Shows the 6 parameters that must be limited to 2

**top graph**error probabilities [alpha1, beta1]

**middle graph**error probabilities [alpha1, beta2] same alpha/beta cutoff,__different__alternate mean, same alpha & different beta probablilities

**lower graph**error probabilities [alpha2, beta2]__different__alpha/beta cutoff, same alternate mean,__different__alpha & beta probabilities

*[clic to enlarge]*

**Behavior of alpha significance, beta power**- error levels are always a compromise for given distributions and means - smaller
__alpha__(fewer false positives) gives bigger__beta__(more false negatives), because they are not independent for fixed distributions __smaller sample size__for a normal distribution gives*lower power*= (1-beta)__smaller standard deviation__(variance of a distribution) gives*higer power*= (1-beta)

- error levels are always a compromise for given distributions and means - smaller

__Z Test__**for significance with**approximating a normal population distribution*large sample size*

**sample assumed from a normal curve****standard deviation is known****y**_{mean}is the sample mean**µ is the population mean****Z****=**[**y**]_{mean}- µ**/**[**σ/SQRT(N)**] has a normal distribution- can rule out chances a
__specific__chosen value of the population mean µ, ie probablities are based on the mean of the alternative distribution - Neyman, Pearson, Fisher developed (early 1900's)
*ok for*__large sample size N__if population distribution approximates normal curve

__One-Sample Student's T test__**used for significance with***small samples and non-normal population distributions*

(*Laplace*T distribution**1814,**p72-74)*for large sample size n,*

**Laplace T = (y**_{mean }- µ)**/****(S/****SQRT****[n])****y**_{mean}is the sample mean**µ is the population mean which is known**

- problem if
**µ**is not know

**Large sample size**- Laplace estimated the
*population*standard deviationwith the__σ__*sample*standard deviation__S__ - assumed the distribution of sample means has a
**normal**distribution- the difference between the population mean and the sample mean divided by the estimated standard error of the sample mean is
**normal,**with**mean=0**and**variance=1**

- the difference between the population mean and the sample mean divided by the estimated standard error of the sample mean is
- central limit theorem for
**large sample size (large N)**, T has a standard normal distribution with reasonably accurate probability coverage

- Laplace estimated the
**Small sample size**- Laplace T distribution is non-normal for small sample size N, even when sampled from a normal curve.
**using**__Z probablilities does not provide a good estimate of the probabilities__(see Colton p128)__for T distribution with small sample size__ **when sampling from light-tailed curve**, the probability curve for the sample means__is__**approximately normal**(p74)*if the population distribution is not skewed*- if the distribution is skewed or heavy-tailed, large sample sizes are required

- Laplace T distribution is non-normal for small sample size N, even when sampled from a normal curve.

*Student's*T Distribution**(William Gossett 1908)**

**T = (y**_{i}**-y**_{bar}) / ([S/sqrt(n)])**y**_{i }= individual sample mean value**y**_{bar}= (1/n)(y_{1}**+y**_{2}**+...+y**_{n}**) = mean of all individual samples (Grand Mean of Sample Means)****S**^{2}**= [(1/(n-1)]*[(y**_{1}**-y**_{bar}**)**^{2}**+ ... + (y**_{n}**-y**_{bar}**)**^{2}**] (Sample Variance for one sample)****S/sqrt(n) = SE =****SQRT****(S**^{2})

- note uses
**ybar**for**µ** __extension__of Laplace T method, derived as**approximation for the probability curve associated with T**- Ronald Fisher gave a more formal derivation
__probability depends on sample size__**assumes**__normality and random sampling__

- although it is bell-shaped and centered about zero, T does
__not__belong to the family of normal curves

- (T) has a standard deviation larger than 1, compared to normal (Z) distribution which has a mean of zero and standard deviation of 1
- T and Z are both symmetric (see Dawson 2001 Basic and Clinical Statistics p99)
- there are infinitely many bell curves that are not normal

- for
**large sample sizes,**the probability curve associated with the T values becomes**indistinguishable from the normal curve**- most computer programs always calculate T instead of Z even for large sample size
- T~Z for
**sample size of 5**or larger

(T underestimates the probablity with 5 samples)*for Tc<1.7,*T < Z>(T overestimates the probability with 5 samples)*for Tc>1.7,*T > Z**[ANOTE]**

for a sample size of 5, calculating the Type I error (for alpha < .05 using Tc*overestimates*the area beyond Tc compared to the normal curve for Zc = Tc.

**Practical Problems with Student's T**

*serious inaccuracies can result for some nonnormal distributions*

- in general, the sample mean and sample variance are
__dependent__for T, that is the sample variance changes depending on the sample mean

- sample average for T is dependent on
and*value*from the population distribution*slopes around that value*

- for normal curve mean and variance are
__independent__

- Gossett's Student T can give exact confidence intervals and exact alpha Type I error for a
__normal__

- When sample from a
distribution (example of__lognormal__**skewed, light-tailed**) actual probability curve for T differs

**plot of T distribution is skewed****(longer tail)**to smaller T in this case, and not symmetric about the mean compared to sampling from normal curve, sample size**twenty**(Fig 3.9, 5.5)**mean of T distribution is shifted to smaller T (skewed to longer-tailed side as well)**compared to the population distribution mean, because the population mean for the log-normal is shifted higher. that is**T is biased to smaller values**(although median is not shifted in the same manner).

__Alpha error__**can be inflated greater than the assumed level (underestimated).**sometimes get a__higher probability of rejecting when nothing is going on__compared to when a difference actually exists

- Student T is then
__biased__ for the population of all individuals__Expected Value E__**E[(Y-µ)**^{2}] =**σ**^{2}is the population mean, or µ[E(Y__Expected Value of the Sample Mean__Y_{bar}_{bar}-µ) = 0]__assumed____that____T is symmetric about zero__**(the expected value of T is zero, or the distribution of Y is symmetric about the mean)**(p80)- the Expected value of T must be zero if the mean and variance are independent, as under normality
- under
*non*- normality, it does not necessarily follow that the mean of T is zero

__for skewed distribution__,to get accurate results for the mean and variance even when outliers are rare (light-tailed), increased to 200 for this example from 20.**estimating the unknown variance with the sample variance in Student's T needs larger sample sizes to compensate**at the 0.05 level__the actual probability of a Type I (alpha error) can be substantially higher than 0.05__- occurs
__when the probability curve for T differs substantially from curve assuming normality__ **[ISLT]**this is a problem when comparing distributions that are**skewed differently.***Bootstrap techniques***can estimate the probability curve for T to identify problem situations (Important newer techniques)**

- occurs

- Student T is then
__Power__**can also be affected with Student's T**(p82)

- the
than the sample mean for Student's T Distribution*sample variance is more sensitive to outliers* - even though the outliers inflate the sample mean, they
*can inflate the sample variance more* - this increases the confidence intervals for the mean of the distribution, and prevents finding an effect (rejecting the null) when compare different distributions, lowering power.
**[ISLT] this more of a problem for power when T distribution is**(distribution of means of samples from parent distribution), rather than when nonnormal and symmetric, or when the parent distribution is skewed*skewed*- evaluate symmetry of residuals for all sample means to check this
- implications of skewed vs heavy-tailed vs nonmonotonic (see ahead)

- the

- in general, the sample mean and sample variance are
**Transforming the data**can improve T approximation- simple transformations sometimes correct serious problems with controlling Type I (alpha)
- typical strategy is to
**take logarithms**of the observations and**apply Student's T to the results** - simple transformations can fail to give satisfactory results in terms of achieving high power and relatively short confidence intervals (beta)
- other less obvious methods can be relatively more effective in these cases. sometimes have a
__higher probability of rejecting when nothing is going on__, than when a difference actually exists

(from Ch9)__Yuen's Method for difference of means__- gives slightly better control of alpha error for large samples
- h is the number of observations left after trimming
- d
_{1}=(n_{1}-1)S_{w1}^{2}/ [h_{1}(h_{1}-1)] - d
_{2}=(n_{2}-1)S_{w2}^{2}/ [h_{2}(h_{2}-1)] - W = [(y
_{meant1}- y_{meant2}) - (µ_{t1 }- µ_{t2})] / [SQRT(d_{1}+d_{2})]

- adding the
to Yuen's Method__Bootstrap Method____may give better control of alpha error for small samples__

- gives slightly better control of alpha error for large samples
is another method that is generally more accurate than Student's T__Welch's Test__

(p82-83)__Two-Sample__Case for Means for significance with small samples

Use**Hypothesis of Equal Means**to obtain the**Confidence Interval for Difference of Means**

__Difference between sample means__estimates the__difference between corresponding population means__that samples are taken from

**LARGE**Sample Size

__Weighted____Two-Sample Difference__**=****W**- assume
__large sample size__**(Laplace)** - assume
__random sampling__ assumption about__no____population variances__**n**are corresponding sample sizes_{1}, n_{2}- Population Variance for weighted difference is
**Var(y**_{bar1}**-y**_{bar2}) = (σ_{1})^{2}/n_{1}**+ (σ**_{2})^{2}/n_{2}

- estimate this Population Variance with the Sample Variances (
**S**)_{1},S_{2}**S**^{2}**= (S**_{1})^{2}/n_{1}+(S_{2})^{2}/n_{2}**SE =**SQRT**[S**^{2}]**=**SQRT**[(S**_{1})^{2}/n_{1}+(S_{2})^{2}/n_{2}]

- use the probabilities for the normal curve at 95%
**W = (y**_{bar1}-y_{bar2}) / SE

or

**W = (y**SQRT_{bar1}-y_{bar2}) /**[(S**_{1})^{2}/n_{1}+(S_{2})^{2}/n_{2}]

**|****W****| >****1.96 = W**_{.95 }is the 95% confidence level to reject the hypothesis of equal means (from normal distribution)**CI**_{.95}= (W_{.95}_{)}***SE = (+****/****-1.96)***SQRT**[(S**_{1})^{2}/n_{1}+(S_{2})^{2}/n_{2}]

**get a reasonably accurate confidence interval by the central limit theorem and good control over a Type 1 (alpha) error**(**better than Two-Sample T**)**if sample sizes are sufficiently large****W**gives somewhat better results for, although this is a**unequal population variances**problem, especially for power.*serious*

**SMALL****Non-Normal**distribution

__Two-Sample Difference T__**=****T**- assume
__random sampling__ - assume
__equal Population Variances__**additionally** **n**are corresponding sample sizes_{1}, n_{2}- For difference between sample means of two groups
**T = (y**_{bar1}-y_{bar2}) / Sqrt[S_{p}^{2}(1/n_{1}+1/n_{2})]

where the assumed common variance is

**S**_{p}^{2}**= [(n**_{1}-1)S_{1}^{2}+(n_{2}-1)S_{2}^{2}] / [n_{1}+n_{2}-2]**the hypothesis of equal means is rejected if****|T| > t**(cutoff t is a function of degrees of freedom**df**and the confidence level, and obtained from tables based on Student's T distribution)**CI**_{.95}= (t_{.95(df)}_{)}***SE = (+****/****-t**_{.95(df)})***SQRT****{[(n**_{1}-1)S_{1}^{2}+(n_{2}-1)S_{2}^{2}] / [n_{1}+n_{2}-2]}

*W and T are equivalent for S*_{1}(equal sample variances)*= S*_{2}*or n*_{1}*= n*_{2}*(equal sample sizes)*

__Student Two-Sample T__**performs well under violations of assumptions**

can substantially__Student Two-Sample T__,__avoid inflation of Type I (alpha) errors__**if**the probability distributions- are
or have the__normal____same shape__(ie, dont have differential skewness) - have the
__same population variance__ - have the
__same sample sizes__ *then acceptable even for*__smaller samples__

- are
- Reliability of summary statistics for
**Two-SampleT****distribution**is affected by theand*shape*(*variance***of population or sample**)*,*and(*size***for samples**). Unequal variances or differences in skewness can greatly affect the ability to detect true differences between means.

**1)**__same shape__

__equal population variances__- alpha (Type 1)
error should be
__fairly accurate__ - normal or nonnormal-identical shape distributions
- same or different sample size

- alpha (Type 1)
error should be
__unequal population variances__distributions*same sample size with normal***sample size***>8***,**alpha (Type I) error is__fairly accurate__**,**no matter how unequal the variances**sample size***<8***,**alpha (Type I) errorat the 5% level__can exceed 7.5%__

distributions, alpha (Type1) accuracy can be*same sample size with nonnormal*__very poor__

with normal or nonnormal distributions, alpha (Type1) accuracy can be*different sample size*__very poor__

**2)**__different shape__

- alpha (type 1)
accuracy can be
__very poor__ **nonnormal****and nonidentical**distributions**equal and unequal population variance**/**same****and different****samples of any size**

- alpha (type 1)
accuracy can be

**Student T - Major Limitations**

**Student's T test is**__not reliable with____unequal population variances__

**Student's T test can have**__very poor power__**beta or false ignoring**of effect, larger confidence intervals) for unequal population variances- exacerbated
__with unequal population variances__ - any method based on means can have low power
- expected value of one-sample T can differ from zero, and power can decrease as the effect gets greater for one-sample T test (the test is biased)
**[ANOTE]**power decreases (T<1.7) but begins to go back up (T>1.7)

- exacerbated
**Student's T test does not control the**(__probability of a Type 1 error__**alpha or false finding**) for unequal population variances__unequal distribution shapes__,__unequal variance__,__unequal sample size__can cause problems in that orderwith__unequal variances__particularly__unequal sample sizes__**can be a disaster for alpha error, even with equal sample sizes or normal distributions**- some argue that this is not a practical issue

**Difference between means****may not represent the typical difference****for two-sample T**, because is the difference of inaccurate estimates of the population means**W****(Weighted T)**helps avoid this compared to**T**

**Transforming the distributions can improve variance properties**- logarithm
- sample median (not a simple transformation in that some observations are given zero weight)
- Rasmussen (1989) reviewed this, found that
*low power*due to outliers is still a problem

(find a__When reject the null with Two-Sample Student T__**difference**), it is an indication that the(__probability distributions differ__*in some manner*__distributions are____not__) (p87-89)__identical__

*if the population curves differ in some manner*__other__*than means, then the magnitude of any difference in the*__means__becomes suspect (see limitations)**population difference**- unequal means
- unequal variances
- difference in skew (shape)
- difference in other measure

**sampling difference**- unequal sample size
- non-randomization

**[ANOTE] a consideration is whether samples with**and thus create a problem using the Student T. If a study factor changes the population distribution (as evidenced by a variance difference), Student T would have problems for the difference of means.__equal sample variances__would be likely to have__unequal population variances__,

**The difference between any two variables having identical skewness is a symmetric distribution. With a symmetric distribution,**__Type I__significance errors are controlled reasonably well. But if the distributions differ in skewness, problems arise. Even under symmetry however, power might be a serious problem when comparing means.

**[ISLT]**it is not so much that the population distributions themselves are skewed, but also factors which affect the derivatives of the distributions in different ways so that the difference curve is not symmetric (that is if skewing is not the same in both populations).*normal curve*has symmetric (linear) derivatives at any point along the curve*light tailed curves*would tend to have less problem (samples tend to be close to mean so symmetry is less of a problem)*monotonic curves*would tend to have less of this problem (exp, step, linear, exponential - samples tend to be symmetric about the mean)*heavy tailed skewed curves*have biased derivatives, so samples tend to favor one side of mean. also heavy tails would affect the variances even for symmetric curves.- different population distributions would compare different biased means when determining differences (if the mean is biased the same way for in each group, when take the difference get accurate measure of difference of means)

- One of the most robust techniques of all for means and variances are newer
**Bootstrap Techniques**(a very important area not discussed here).

*"There has been an apparent errosion in the lines of communication between mathematical statisticians and applied researchers and these issues remain largely unknown. It is difficult even for statisticians to keep up. Quick explanations of modern methods are difficult. Some of the methods are not intuitive based on the standard training most applied researchers receive."*

**Issues about the sample mean (nonnormality)**

*nonnormality can result in*__very low power____poor assessment of effect size__- cannot find an estimator that is always optimal
- problem gets worse trying to measure the association between variables via least squares regression and Pearson's correlation

*differences between probability curves**other than the mean can affect conventional hypothesis testing between means*- population mean and population variance are
**not robust**; they are sensitive to very small changes for any probability curve - affects Student's T, and its generalization to multiple groups using the so-called ANOVA F-test.
- variance of mean is smaller than variance of median for normal curve, but variance of mean is larger than variance of median for mixed-normal distributions, even for a a very slight departure from normality
**George Box**(1954) and colleagues- sampling
**from normal distributions, unequal variances has no serious impact on the probability of a Type I (alpha) error** **if ratio of variances is less than 3, Type I (alpha) errors are generally controlled**- restrict ratio of the standard deviation of the largest group to the smallest at most sqrt(3)~2
- if this ratio gets larger, practical problems emerge

- sampling
**Gene Glass**(1972) and colleagues, and subsequent researchers- indicate problems for unequal variances in the ability to control errors
- if groups
**differ in terms of both variances and skewness get unsatisfactory power properties**(cant detect effects)

**H. Keselman**(1998)

- population mean and population variance are

**Summary of Factors**affecting the probablities for the mean and variance

*nonnormality**and**mixed normality*- small departures from normality can
tremendously**inflate the population variance**

- small departures from normality can
, or changes in variance with predictor ( σ*heteroscedasticity*^{2}ratio > 3 , σ ratio > 1.7)- non-constant variance
- unless randomly affected rather than systematic

in population distribution*skew*- asymmetry (very serious)

*heavy tailed population*- outliers dominate and can inflate the population variance tremendously

*differences between probability curve distributions*and__M-estimator__can improve these problems__trimmed mean__

**M-estimators**- for symmetric curves can give fairly accurate results
- under even slight departures from symmetric curves, method breaks down especially for small sample size
- also applies to two sample case
- percentile bootstrap method performs better than percentile t method in combination with M-estimators for less than twenty observations

[p143-149]__Percent Trimmed Mean__

**General Trimmed Mean**

- advantages of 20% trimmed mean for the very common situation of of
probability curve__heavy-tailed____sample variance is inflated__compared to samples from a light-tailed distribution__for heavy-tailed____trimmed mean improves accuracy and precision for the mean__- tends to be substantially
**closer the the central value**

- tends to be substantially
__trimmed mean reduces the sample variance__*more likely to get high power and relatively short confidence intervals**if use a trimmed mean*

__symmetric__vs__skewed__**distributions does not affect this**

- discards less accurate data that contaminates the population estimates
- for a
**normal curve**or one with**light tails**, the sample mean is more accurate than the trimmed mean, but not substantially - the middle values among a random sample of observations are much more likely to be close to the center of the
**normal curve**. - for the sample means, the extreme values hurt more than they help with even small departures from normality
- by trimming, in effect remove the heavy tails that bias the variance and thus the mean
- trimmed mean is
**not a weighted mean**and not covered by the Gauss-Markov theorem, since it involves ordering the observations- when remove extreme values the remaining observations are
*dependent*

- when remove extreme values the remaining observations are
**breakdown point**is**0.2 for the 20% trimmed mean**- the minimum proportion of outliers required to make the 20% trimmed mean arbitrarily large or small is 0.2 (20%)
- compare to the breakdown point of sample
**1/n****for the mean**and**0.5 (50%)****for the median** - arguments have been made that a
*sample breakdown point <0.1 is unwise* - so sample mean is awful relative to outliers

- for a
- for symmetric probability curves, the mean, median and trimmed mean of the population are all identical
- for a skewed one, all three generally differ
**when distributions are**(see Fig 8.3 p148)*skewed,*the median and 20% trimmed mean are argued to be better measures of what is typical

- modern tools for characterizing the sensitivity of a parameter to small perturbations in a probability curve (since 1960)
- qualitative robustness
- infinitesimal robustness
- quantitative robustness (breakdown point of the sample mean for infinite sample size)

- advantages of 20% trimmed mean for the very common situation of of
__Gamma%____Trimmed Mean__

**Procedure**

N, the number of observations__determine__gamma*N (gamma= trim percentage)__compute__down to the nearest integer = g__round__the g smallest and g largest values (those with the largest deviations from the mean)__remove__the n-2g values that remain__average__

__Example: 20% Trimmed Mean__**(gamma=.2 or 20%)**

- Variance of the sample trimmed mean refers to the variation among the infinitely many values from repeating a study infinitely many times

- Computing the variance of the sample based on the
__data left after trimming__the sample mean, and the__standard error of the mean__by dividing this by the__number of samples__(as do in computing the sample mean) isto obtain the trimmed variancesatisfactory**not**- usual variance of sample mean: if divide the sum of variables by
**n**, the variance of the sum is divided by**n****Var(y**_{mean}) =**σ**^{2}**/ n**

- usual variance of sample mean: if divide the sum of variables by
- the trimmed mean observations are
(p162)**not independent**- the trimmed mean is an average of dependent variables and the sample mean method requires independence
- the variance of the sum of dependent variables is not equal to the sum of the individual variances, because the ordered variables are not independent

- want to find the situations where the sample trimmed mean has a normal distribution

- Computing the variance of the sample based on the
**Estimating the sample trimmed mean, and variance of the sample trimmed mean**(p162)

__WINDSORIZE__

**rewrite the trimmed mean as the average of the independent variables**

__COMPUTE__**Windsorized Sample Mean (Windsorized Mean)**and**the Sample Variance****of the Windsorized mean (Windorized Sample Variance)**- put the observations in
**order** **trim gamma*n = g observations hi/lo**(use Windsorized Variance)- replace the trimmed values with the smallest of the non-trimmed values, so that the
not trimmed, and the g largest values likewise decreased (labled the__g smallest values are increased to the g+1th value__**W**)_{n}

- replace the trimmed values with the smallest of the non-trimmed values, so that the
**subtract**the Winsorized Mean from each of the Windsorized values,**square**each result, and then**sum****divide by n-1**(number of observations-1) as when computing the sample variance s^{2}**Windsorized Sample Mean****ybarw = (1/n)(W**_{1}**+W**_{2}**+...+W**_{n})

**Windsorized Sample Variance****S**_{w}^{2}**= [(1/(n-1)][(W**_{1}**-ybar**_{w})^{2}**+ ... + (W**_{n}**-ybar**_{w})^{2}]

- put the observations in
__ADJUST__the area under the curve for trimming to sum to probability of 1

__Estimated Variance of the Trimmed mean__**Var(ybar**_{T}) = S_{w}^{2}**/ (1-2****γ)**^{2}**n**

__20%Trimmed mean Estimated Variance__**(****γ=.2, g=2, 1-2****γ=.6)****Var(y**_{barT}) = S_{w}^{2}**/ (.36n)****SD(y**_{bar}**) = S**_{w}**/ [.6****SQRT****(n)]**

*Windsorized Variance for the trimmed mean is generally smaller*than the simple sample variance, because it pulls in extreme values that inflate S^{2}**trimmed mean will be more accurate than the sample mean when it has a smaller variance***for very*__small departure from normality__, the variance of the trimmed mean can be__substantially____smaller__*the trimmed mean is relatively*__unaffected when sampling from the mixed normal curve__*true even when sampling from*__skewed distributions__

- division by (1-2
γ)
^{2}could sometimes give the trimmed mean a__larger____typically any such improvement using the mean instead of trimmed mean is small__

__Standard Error of the Trimmed Mean__**(estimate of the Population trimmed mean)*****

__Small Sample Size (20% trimmed mean)__

**Laplace Method**approximation to the normal curve**T**_{t}**= (y**_{bart}**- µ**_{t}) / [S_{w}**/.6sqrt(n)]****CI = (+/-)t**_{crit(df)}S_{w}**/ [.6sqrt(n)] = (+****/****-)SE**- approximates the curve for Student T where
**df = n-2g-1**(degrees of freedom)- for small sample sizes, the smaller the effect of non-normality up to 20% trimmed mean (less so as trim more)
- improves rapidly as sample size increases

**γ=0.2, g=2, n=12 case**- n-2g-1 = 12-4-1 = 7
- for 20% trim with 12 degrees of freedom at 95% confidence, T
_{crit}= 2.365, ie., T_{t}>2.365 has probability <5% **CI = (+****/****-) 2.365S**_{w}/[.6sqrt(n)]

__20% Trimmed Mean can be substantially more accurate than any method based on means, including the percentile bootstrap method__

*indications are that can avoid alpha inaccuracies with sample sizes as small as 12 when combine the percentile bootstap method and the trimmed mean*- better control over alpha errors
- better power in a wide range of situations
- the problems of bias appear to be negligible (power decreasing as move away from the null hypothesis
- trimmed mean and sample mean are identical if sample from a symmetric population distribution
- if sample from skewed distribution, 20% trimmed mean is closer to the most likely values and provides a better reflection of the typical individual in the population
- confidence intervals for the trimmed mean is relatively insensitive to outliers (might be slightly altered when add the percentile bootstrap technique)
- does not eliminate all problems however, reduces them

**Power using Trimmed Mean versus Simple Mean**(p172)

**confidence interval for the difference between trimmed means is considerably shorter**than the confidence interval for the difference between sample means, because it is less affected by outliers(ability to detect an effect)*even small departures from normality can result in very low power*- trimmed mean can reduce the problem of bias with a skewed distribution substantially
- it is common to find situations where one fails to find a difference with means, but a difference is found with trimmed means.
- it is rare for the reverse to happen, but it does happen

**More issues about Student's T**

**how accurate the sample mean is for the population mean**(estimates the population__mean__), contrasted to**how well the sample mean describes an individual in the population**(estimates the population__variance__about the population mean)**probability curve for T may not be symmetric around zero**because of non-normality and asymmetry- 20% trimmed mean with a percentile bootstrap insures a more equi-tailed test, than when using a percentile T bootstrap with 20% trimmed mean

- when
**comparing multiple groups**, probability of an alpha error can drop well below the nominal level, and the power can be decreased. switching to a percentile bootstrap method with 20% trimmed mean can address this. __Percentile method using the 20% trimmed mean performs about as well as Student's T____Percentile T method__**works better than**__percentile method__**in combination with the 20% trimmed mean as to shorter confidence intervals (Variances)**, and this improves for more than two groups

**Least Squares Regression**

**for probability of a Type I error**() when testing the hypothesis of zero slope,*alpha error, seeing an effect that isnt there***least squares performs reasonably well**.

- correlations between observations is a disaster for seeing an effect that isnt there (not discussed here)

- if want to detect an association (small
) and describe that association (small*beta*), least squares can fail miserably*CI*can result in relatively__heterodscedasticity__when testing the hypthesis that the slope is zero*low power*can__single unusual point__an important and interesting association**mask**makes this__nonnormality__, and the conventional methods for the slope can be highly inaccurate**worse**

**when there is nonmormality or heteroscdedasticity**, several__additional methods__compete well, and can be__strikingly more accurate than Least Squares__

**Theil-Sen Estimator****Least Absolute Value Estimator****Least Trimmed Squares****Least Trimmed Absolute Value****Least Median of Squares****Adjusted M-estimator****newer Empiric methods****Deepest Regression Line (still being developed)**

- Using a regression estimator with a high breakdown point is no guarantee that disaster will be avoided with small sample sizes

- even among the robust regression estimators, the
**choice of method can make a practical difference** - a lower than max breakdown point may yield a more accurate estimate of the of the slope and intercept
- the example given (p 215) goes from a positive least squares estimate for slope (1.06, h=0 no trim), to a negative slope (-.62, h=.5n)
- using h=.8 (breakdown point =.2) gives a slope of 1.37
- for h=.75, gives a slope of .83
- Least median of squares estimate gives 1.36 (breakdown point of 0.5)
- Theil-Sen gives a slope of 1.01, which is more accurate than the least squares estimate

- even among the robust regression estimators, the

__Least Trimmed Squares__*******(p 212-215)

__Procedure for Least Trimmed Squares__**:**

__Ignore__**(trim) the largest residuals when calculating the variance used in the least square estimate**

*order the residuals**compute the variance*based on the h=çn smallest residuals- with
**ç=.5**(h=.5n), the breakdown point is .5, but the*smallest value of ç that should be used is n/2 rounded down to the nearest integer plus 1* **ç > Int(n/2) + 1**

- with
*choose slope b1 and intercept b0***that minimizes this reduced variance**for y=b0 +b1x line fit

- in effect by trimming them, removes the influence of the (n-h) largest residuals

- for h=.75n, ignores .25n in calculating the variance, so 25% of the points can be outliers without destroying the estimator

**[ISLT] the problem of highly inaccurate estimate of the population mean by the sample mean arises when the trim process changes the sign of the parameter estimate (eg, slope estimate for Least Squares).**- may be general principle that regression estimators break down when they have sign change from the untrimmed estimate (third derivative of the curve changes at places of zero slope, consider population distribution versus sample distribution)
- this is a principle when linear fitting a sharply curved plot (non-monotonic compared to normal curve)
- need to identify points of zero slope and inflection so that estimates are contained in a place that is monotonic around the actual value, and not skip over a max or min that takes the estimate further away
- graphical or other diagnostic tools appear instrumental to identifying these situations -
__graph the data!__

**Regression Outliers and Leverage Points**(p217)

**Outliers**- regression outliers are points of the linear pattern that lie relatively
__far from the line__around which most points are centered - one way is to determine them is to use the
**least median squares**regression line and compute the residuals using some criteria

- regression outliers are points of the linear pattern that lie relatively
**Leverage Points**

- unusually
**large or small****X (predictor) values**are called**leverage points** - these points are
**heteroscedastic**, with large residual (variance) for these large or small predictors is one that has a__good leverage point____large or small predictor value (X)__, but is(residual of Y from fit is small,__not a regression outlier____doesnt affect slope or variance)__is one that has a__bad leverage point____large Y residual for large or small predictor values (X)__, and grossly__affects the estimate of the slope__- the effect of outliers can affect the variance in a different way from the mean
**bad**leverage points can__lower the standard error__despite the large (Y) residuals, since the larger the spread in predictor (X) values, the lower the standard error (SE) despite any leveraged outliers. but this can be more than offset by the__slope more__**inaccurate estimate of the slope**.**"rogue outliers"**affect theand the__slope less__by its leverage placement. y and x for the outliers are both inflated or deflated, so that Y/X keep close to the regression estimate although the__variance more__**variance is inflated**.

- unusually

If the parent distributions don't really bunch around a central tendancy, are heavily skewed (bunched other than at the central tendancy) or are heavy-tailed, it is less accurate to use central tendancies to describe and compare them because the samples taken from those distributions can be biased.

The

"For example, the combination of the Percentile T Bootstrap and the 20 percent trimmed mean ... addresses all the problems with Student's T test, a result supported by both theory and simulation studies. The problem of

The accuracy of probabilities and significance of differences depend on more than just the value of a central tendancy.

TDB