A **confidence interval** is constructed from sample data is a range of values that is likely to include the population parameter with a certain probability.

The objective of a confidence interval is to provide location and precision of population parameters.

Confidence Interval for the population mean may be stated as 30 \le \mu \le 50 which means population means lies between values of 30 and 50.

Since the interval estimate may or may not contain the true parameter estimate, we associate confidence (probability) of finding true parameter value in the interval.

We may say that there is a 95% confidence that the interval contains the population mean which also implies that there is a 5% chance that the interval may not contain the population means.

Confidence levels are usually written as (1-\alpha)100% on the interval estimate of a population parameter and it is the probability that the interval estimate will contain the true population parameter.

When, \alpha=0.05,95% is the confidence level and 0.95 is the probability that the interval estimate will have the population parameter.

The value of \alpha is called significance which signifies that the chance of not observing the true population means in the interval estimate.

## Confidence Interval for Population Mean when Standard Deviation is known

The confidence interval for a population mean is determined by taking the sample mean(point estimate) and adding or subtracting a margin of error to it.

\overline{X} \pm E

If the population Standard deviation is known, the margin of error is determined by

E = Z_{\alpha/2} \frac{\sigma}{\sqrt{n}}

where \alpha \text{ (significance level)} = 1 - CL\text{ (Confidence level)} and correspondingly, CL = 1 - \alpha

So, if the CL = 95% \alpha = 1 - 0.95 = 0.05

Z_{\alpha/2} is called the critical value which can be found in the Z table.

z value tells us how many Standard deviations an observation is from the mean. A Z score of -2 tells us that the observation is 2 Standard deviation to the left of the mean.

More specifically, it allows us to calculate how much area a specific Z score is associated with and we can find out the exact area using a **Z table** also known as **Standard Normal Table.**

The table tells us the total amount of area contained to the left side of any value of Z.

The Top row and the first column corresponds to the Z value and all the numbers in the middle corresponds to the areas.

Now, let’s find the Z value for a 95% confidence interval.

We know that, \alpha= 0.05 for 95% confidence interval. The total area represents 1. Since 95% or 0.95 is the area in the middle and leftover area is the \alpha, we have to divide \alpha into two equal parts which will correspond to 0.025 area to the left and 0.025 area to the right.

So, the area to the left will be 0.95 + 0.025 = 0.975. We can calculate the Positive Z value by looking at the Z table and finding the area as closest to 0.975 which is 1.96.

This Z value tells us that 95% of the area lies with roughly 1.96 standard deviations of the mean.

Since the normal distribution is symmetrical distribution the corresponding value to the left of the curve will be -1.96.

We can write the 95% confidence interval for the population mean when population standard deviation is known as :

\overline{X} \pm 1.96 \frac{\sigma}{\sqrt{n}}

**Example:**

A sample of 100 subjects was chosen to estimate the length of stay at a hospital. The sample mean was 4.5 days and the population standard deviation was known to be 1.2 days.

- Calculate the 95% confidence interval for the population mean.
- What is the probability that the population means is greater than 4.73 days?

**Solution:**

(1) Known Values are:

\overline{X} = 4.5days \sigma=1.2

Estimated value of mean from a sample size can be calculated using

\overline{X} \pm E and we know that Margin of Error \text{(E)} = \frac{\sigma}{\sqrt{n}} and thus the formaula can be written as:

\overline{X} \pm Z_{\alpha/2} \text{ x } \frac{\sigma}{\sqrt{n}}

So, \sigma/ \sqrt{n} = 1.2 / \sqrt{100} = 0.12

The 95% confidence interval is given by:

4.5 - 1.96 \text{ x }0.12 \text{ and } 4.5+1.96 \text{ x }0.12 = (4.2648,4.7352)where \pm 1.96 is the critical value obtained from the Z table for 95% confidence Interval where \alpha \text{/}2= 0.0975 and 0.025

Thus, to interpret this we can say that we are 95% confidence that the population mean is between 4.2648 and 4.7352.

4.2648 \le \mu \text{ }4.7352

(2) Since, the upper limit of 95% confidence interval is 4.7352, we can say that, the probability of population mean greater than 4.7352 is aproximatey 0.025.

### Calculating the Confidence Interval in a SAS data step

The Confidence Interval can be calculated in a SAS data step as below.

```
data CI;
N=100;
SAMPLE_MEAN=4.5;
STD_DEV=1.2;
ALPHA=0.05;
Z =probit(1-ALPHA/2);
LCLM=SAMPLE_MEAN-Z*STD_DEV/SQRT(N);
HCLM=SAMPLE_MEAN+Z*STD_DEV/SQRT(N);
run;
proc print;
```

## Confidence Interval for Population Mean when Standard Deviation is unknown

When the confidence interval is unknown we will not be able to use the below formula.

X \pm Z_{\alpha/2} \frac{\sigma}{\sqrt{n}}It is proved by William Gossett, that if the population follows a normal distribution and the standard deviation is calculated from the sample then the statistics given below will follow a t-distribution with (n-1) degrees of freedom.

t = \frac{\overline{X} - \mu}{S/\sqrt{n}}

S is the standard deviation estimated from the sample. The t-distribution is almost similar to standard normal distribution. It has a bell shape and its mean, median and mode are all equal to 0.

The major difference between t-distribution and standard normal distribution is that t-distribution has a broad tail compared to standard normal distribution. However, as the degrees of freedom increase, the t-distribution converges to a standard normal distribution.

The (1-\alpha) 100% confidence interval mean from a population that follows a normal distribution when the standard deviation is unknown is given by \overline{X} \pm t_{\alpha/2,n-1} \text{ x } \frac{S}{\sqrt{n}}

**Example :**

An online grocery store is interested in estimating the basket size of its customer orders so that it can optimize the size of crates used for delivering the grocery items. For a sample size of 70 customers, the basket size was estimated as 24 and the standard deviation estimated from its sample was 3.8. Calculate the 95% confidence interval for the basket size of the customer order.

**Solution:**

n=70,\overline{X} =24, S = 3.8 degress of freedom is (n-1) = 69

The T- value can be found using the T table or using the **TINV** function in SAS.

Using the T- table, you have to look at the intersection of degrees of freedom for the corresponding Confidence Level.

Since the degrees of freedom 69 is not available, we have to look for the closest value of 69 which is 60 and the corresponding T value is 2.000.

The confidence, interval for the size of basket is given by

\overline{X} \pm t_{\alpha/2} \frac{S}{\sqrt{n}}The Lower confident limit is given by 24 - 2 \frac{3.8}{\sqrt{70}} = 23.09

The Upper confident limit is given by 24 + 2 \frac{3.8}{\sqrt{70}} = 24.91

Thus, the 95% confidence interval for the size of the basket is (23.09,24.91)

### Calculating Confidence Interval in SAS

In SAS, we can use the **PROC MEANS** procedure with the **CLM** option to find the Lower and Upper Confidence limit.

I have simulated the above example using random numbers and calculated the Lower and Upper Confidence limit as below.

```
data basket;
do i=1 to 70;
size=round(20+ floor(1+30-20)*rand("uniform"), .01);
output;
end;
drop i;
run;
proc means alpha=0.05 clm mean std maxdec=3;
var size;
run;
```

If you don’t have the actual raw data, you can use also use the below data step to calculate Confidence Limit.

```
data CI;
X=24;
S=3.8;
N=70;
ALPHA=0.05;
CRITICAL_VALUE=TINV(1-ALPHA/2, N-1);
LCLM=X - (CRITICAL_VALUE * S/sqrt(N) );
HCLM=X + (CRITICAL_VALUE * S/sqrt(N) );
run;
proc print;
```