14 Inference for Proportions
14.1 Conditions for Inference
1. Random
The data must come from a random sample, or random assignment in an experiment.
2. Large
You must check the following to know if the sample is large enough for us to know if we have an approximately normal distribution.
For proportions, remember, the population is never anywhere near normal (it’s always two bars, yes and no). In this case we check the Large Counts Condition.
This ensures that our sample proportion can take on enough different values (and make enough bars in a histogram) to create an approximately normal sampling distribution.
3. Sampling Independence
Observations in our sample must be independent of each other.
In random samples from a population, observations are never independent because the population changes with every person we sample and remove from it. However, this effect is small enough to ignore as long as the population from which we’re sampling is at least 10 times as large as our sample. This needs to be stated!
14.2 One-Sample \(z\)-procedures
14.2.1 One-Sample \(z\)-interval
A \(C\)% confidence interval for the unknown population proportion \(p\) when all conditions are met is calculated with the following:
\[\hat p \pm z^\ast \sqrt{\frac{\hat p (1 - \hat p)}{n}}\]
where \(z^\ast\) is the critical value for the standard Normal curve with \(C\)% of its area between \(-z^\ast\) and \(z^\ast\).
See Appendix C for examples on how to calculate \(z^\ast\) for a given confidence level \(C\)%
Why?
All confidence intervals are calculated by: \(\text{point estimate } \pm \text{ margin of error}\)
The point estimate for the true population proportion is \(\hat p\)
The margin of error is always calculated as \((\text{critical value})(\text{standard error})\)
We use \(z^\ast\) critical values for \(z\)-intervals (which is why they are called \(z\)-intervals).
The standard error is calculated as \(\sqrt{\frac{\hat p (1 - \hat p)}{n}}\)
14.2.2 One-Sample \(z\)-test
Given the null hypothesis \(H_0: p=p_0\), and an observed sample proportion \(\hat p\) from a sample of size \(n\) when all conditions are met. The null sampling distribution of \(\hat p\), calculated assuming \(H_0\) is true, has the following:
\[\mu_{\hat p} = p_0\] \[\sigma_{\hat p} = \frac{\sigma_{p_0}}{\sqrt n} = \sqrt{\frac{p_0(1- p_0)}{n}}\]
The z-statistic is the standardized score of \(\hat p\) under the mean and standard deviation of the null distribution:
\[z=\frac{\hat p - p_0}{\sqrt{\frac{p_0(1- p_0)}{n}}}\]
The p-value, calculated using the z-statistic, is based off the direction of the alternative hypothesis, \(H_a\). \(Z\) represents the standard normal distribution.
\[ \text{p-value}= \begin{cases} P(Z<z) & \text{ if } H_a:p<p_0 \\ P(Z>z) & \text{ if } H_a:p>p_0 \\ 2\cdot P(Z<-|z|) & \text{ if } H_a:p\not =p_0 \end{cases} \]
When checking to see if the sample is large enough, make sure to adjust the large counts condition accordingly. We are assuming the null hypothesis is true, so check if
\(n p_0\) and \(n(1- p_0)\) are both greater than or equal to 10.
14.3 Two-Sample \(z\)-procedures
Two-Sample z-procedures are justified by knowledge about the sampling distribution of \(\hat p_1 - \hat p_2\).
The conditions for two-sample z-procedures are the same for the two that we learn:
Random: The data come from two independent random samples or from two groups in a randomized experiment.
Large Counts: The counts of “successes” and “failures” in each sample or group \(n_1\hat p_1\), \(n_1(1-\hat p_1)\), \(n_2\hat p_2\), and \(n_2(1-\hat p_2)\) are all at least 10.
- or if you are working with a \(z\)-test, make sure to instead check it according to the null value \(p_0\). \(n_1 p_0\), \(n_1(1- p_0)\), \(n_2 p_0\), and \(n_2(1- p_0)\) are all at least 10.
10%: When sampling without replacement, check that \(n_1\leq(.10)N_1\) and \(n2\leq(.10)N_2\).
14.3.1 Two-Sample \(z\)-interval
When the conditions are met, an approximate \(C\)% confidence interval for \(p_1-p_2\) is:
\[(\hat p_1-\hat p_2)\pm z^\ast\sqrt{\frac{\hat p_1(1-\hat p_1)}{n_1}+\frac{\hat p_2(1-\hat p_2)}{n_2}}\]
Where \(z^\ast\) is the critical value for the standard Normal curve with \(C\)% of its area between \(-z^\ast\) and \(z^\ast\)
14.3.2 Two-Sample \(z\)-test (#two-sample-z-test)
The null hypothesis has the general form \[ H_0:p_1-p_2=p_0 \] where \(p_0\) is our hypothesized difference. We’ll restrict ourselves to situations in which \(p_0 =0\). Then the null hypothesis says that there is no difference between the two parameters: \[H_0:p_1-p_2=0~~\text{or}~~H_0:p_1=p_2\] The alternative hypothesis says what kind of difference we expect: \[H_a:p_1-p_2>0~~\text{or}~~Ha:p_1-p_2<0~~\text{or}~~Ha:p_1-p_2\not =0 \]
Suppose the conditions are met. To test the hypotheses \(H_0: p_1 - p_2 = 0\), first find the pooled proportion \(\hat p_c\) of successes in both samples combined.
\[\hat p_c = \frac{X_1 + X_2}{n_1 + n_2}\]
Then compute the z statistic: \[ z = \frac{(\hat p_1 - \hat p_2) - 0}{\sqrt{\frac{\hat p_c(1-\hat p_c)}{n_1} +\frac{\hat p_c(1-\hat p_c)}{n_2}}}\]