- Information
- AI Chat
2022 Fall - MAT 152 - Unit 2
Statistical Methods I (MAT 152)
Central Piedmont Community College
Preview text
Sec 4 – Scatter Diagrams and Correlation
A) Scatter Diagram – a graph that shows the relationship between two quantitative variables measured on the same individual. Each individual in the data set is represented by a point in the diagram. The explanatory variable is plotted on the horizontal axis, the 𝑥-value, and the response variable on the vertical axis, the 𝑦- value. The response variable is the variable whose value can be explained by the value of the explanatory/predictor variable.
*Ex. The data shown below are based on a study for drilling rocks in Knowhere’s mines. The Collector wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins.
Depth at which Drilling Begins
35 50 75 95 120 130 145 155 160 175 185 190
Time to Drill 5 Feet
5 5 6 6 7 6 6 7 7 7 6 7.
(1) Decide which is the explanatory variable and the response variable.
Depth is explanatory and Time is response
(2) Draw a scatter diagram of the data (points already plotted).
There are various types of relations in a scatter diagram.
B) Linear Correlation Coefficient – a measure of the strength and direction of the linear relation between two quantitative variables.
Positively Associated - for two variables, whenever the value of one variable increases, the value of the other variable also increases.
Negatively Associated – for two variables, whenever the value of one variable increases, the value of the other variable decreases.
The population correlation coefficient is represented by the Greek letter 𝝆, pronounced rho. The sample correlation coefficient is represented by the letter 𝒓. The formula for the sample correlation coefficient is
𝑟 =
∑ (𝑥𝑖
− 𝑥̅
𝑠𝑥 ) (𝑦
𝑖 − 𝑦̅
𝑠𝑦 )
𝑛 − 1
where 𝑥̅ and 𝑠𝑥 is the sample mean and standard deviation of the explanatory variable and 𝑦̅ and 𝑠𝑦 is the sample mean and standard deviation of the response variable.
C) Testing for Linear Relation
Determine the absolute value of the correlation coefficient.
Find the critical value in table II for the given sample size.
Table II
n Critical Values for Correlation Coefficient
3 0 15 0. 4 0 16 0.
5 0 17 0. 6 0 18 0.
7 0 19 0. 8 0 20 0.
9 0 21 0. 10 0 22 0.
11 0 23 0. 12 0 24 0.
13 0 25 0. 14 0 26 0.
- If the absolute value is greater, we say a linear relation exists between the two variables. Otherwise, no linear relation exists.
*Ex. Determine whether a linear relation exists between time to drill five feet and depth at which drilling begins. Comment on the type of relation that appears to exist between the two.
0>0.
D) Causation vs Correlation – When data is observational, we cannot claim a causal relation exists between two variables. We can only claim causality when the data are collected through a designed experiment.
*Ex. According to data obtained from JARVIS, the correlation between the percentage of female population with a bachelor’s degree in New York City and the percentage of births to unmarried mothers in Wakanda since 1990 is 0. Does this mean that a higher percentage of females with bachelor’s degrees causes a higher percentage of births to unmarried mothers?
No, correlation only exist because both percentages have increased since 1990
Another way two variables can be related even though there is not a causal relation is through a lurking variable – related to both the explanatory and response variable. For example, Bucky Barnes noticed ice cream sales and crime rates have a very high correlation. Does this mean that he should report this to the local governments and shut down all ice cream shops? No, the lurking variable is temperature. As air temperature rises, both ice cream sales and crime rates rise.
*Ex. Because colas tend to replace healthier beverages and contain caffeine and phosphoric acid, researchers Drax and Gamora wanted to know whether cola consumption is associated with lower bone mineral density in women. The table lists the typical number of cans of cola consumed in a week and the femoral neck bone mineral density for a sample of 15 women. Data was collected through surveys and questionnaires.
Number of Colas Per Week Bone Mineral Density (𝐠/𝐜𝐦𝟐) 0 0. 0 0. 1 0. 1 0. 2 0. 2 0. 3 0. 3 0. 4 0. 5 0. 5 0. 6 0. 7 0. 7 0. 8 0.
Sec 4 – Least Square Regression
A) Least Square Regression Line – The difference between the observed value of 𝑦 and the predicted value of 𝑦 gives us the error or residual.
*Ex. Use the following sample data:
𝒙 0 2 3 5 6 6 𝒚 5 5 5 2 1 2.
(1) Find a linear equation that relates 𝑥 and 𝑦 by using (2, 5) and (6, 1).
𝑦 = −0 + 7.
(2) Use the equation to predict 𝑦 if 𝑥 = 3.
(3) Determine the residual for 𝑥 = 3.
0.
The least square regression line is the line that minimizes the sum of the squared errors or residuals. This line minimizes the sum of the squared vertical distance between the observed values of 𝑦 and those predicted by the line 𝑦̂. The equation is given by
𝑦̂= 𝑏 1 𝑥 + 𝑏 0
where 𝑏 1 = 𝑟 ∗
𝑠𝑦 𝑠𝑥 is the slope and 𝑏 0 = 𝑦̅− 𝑏 1 𝑥̅ is the 𝑦-intercept.
*Ex. The data shown below are based on a study for drilling rocks in Knowhere’s mines. The Collector wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins. Recall 𝑟 = 0.
Depth at which Drilling Begins
35 50 75 95 120 130 145 155 160 175 185 190
Time to Drill 5 Feet
5 5 6 6 7 6 6 7 7 7 6 7.
(1) Find the least squared regression line. Round to four decimal places as needed.
Y=0+5.
X=126 and sx = 52. Y=6 and sy=0.
(2) Predict the drilling time if drilling starts at 130 feet.
7.
(3) Is the observed drilling time at 130 feet above, below, or average?
Observed is 6 and predicted is 7. time is below average
Sec 4 – Diagnostics on the Least Squares Regression Line
A) Coefficient of Determination, 𝑹𝟐 – measures the proportion of total variation in the response variable that is explained by the least squares regression line. The coefficient of determination is a number between 0 and 1, inclusive. That is, 0 ≤ 𝑅 2 ≤ 1. If 𝑅 2 = 0, the line has no explanatory value. If 𝑅 2 = 1, the line explains 100% of the variation in the response variable.
To determine 𝑅 2 for the linear regression model, simply square the value of the linear correlation coefficient.
Note: Squaring the linear correlation coefficient to obtain the coefficient of determination works only for the least squares linear regression model. The method does not work in general.
*Ex. The data shown below are based on a study for drilling rocks in Knowhere’s mines. The Collector wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins.
Depth at which Drilling Begins
35 50 75 95 120 130 145 155 160 175 185 190
Time to Drill 5 Feet
5 5 6 6 7 6 6 7 7 7 6 7.
Below are some statistics found in the previous two sections.
Mean Standard Deviation Depth 126 52. Time 6 0. Correlation 0. Regression Equation 𝑇𝑖𝑚𝑒 = 0 ∗ 𝐷𝑒𝑝𝑡ℎ + 5.
(1) Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be the best guess?
The mean time to drill 5 feet is 6 minutes
(2) Suppose we are asked to predict the time to drill an additional 5 feet if the current depth of the drill is 160 feet.
7.
(3) Find and interpret the coefficient of determination for the drilling data.
R^2=0 so 59% of the variability is explained by the least squares regression line.
B) Residuals – recall residuals are the difference between the observed value and the predicted value. They play an important role in determining the adequacy of the linear model. In fact, residuals can be used for the following:
To determine whether a linear model is appropriate to describe the relation between the predictor and response variable
To determine whether the variance of the residuals is constant,
To check for outliers. A plot of residuals against the explanatory variable may also reveal outliers. These values will be easy to identify because the residual will lie far from the rest of the plot.
C) Influential Observation – an observation that significantly affects the least squares regression line’s slope and 𝑦-intercept or the value of the correlation coefficient. It typically exists when the point is an outlier relative to the values of the explanatory variable. So case 3 is likely influential.
Sec 5 – Probability Rules
A) Probability – a measure of the likelihood of a random phenomenon or chance behavior occurring. Probability describes the long-term proportion with which a certain outcome will occur in situations with short-term uncertainty.
In probability, an experiment is any process that can be repeated in which the results are uncertain. The sample space, 𝑺, of a probability experiment is the collection of all possible outcomes. An event is any collection of outcomes from a probability experiment. It may consist of one outcome or more than one outcome. We will denote events with one outcome, sometimes called simple events, 𝒆𝒊. In general, events are denoted using capital letters such as 𝑬.
*Ex. Consider the probability of Tony Stark and Pepper Potts having two children.
(1) Identify the outcomes of the probability experiment.
e=gb e=bg e=gg e=bb
(2) Determine the sample space.
S={gg,bb,gb,bg}
(3) Define the event 𝐸 = ℎ𝑎𝑣𝑒 𝑜𝑛𝑒 𝑏𝑜𝑦
E={gb,bg}
B) Rules of Probabilities
The probability of any event 𝐸, 𝑃(𝐸), must be greater than or equal to 0 and less than or equal to 1. That is, 0 ≤ 𝑃(𝐸) ≤ 1.
The sum of the probabilities of all outcomes must equal 1. That is, if the sample space 𝑆 = {𝑒 1 , 𝑒 2 , 𝑒 3 ... 𝑒𝑛}, then 𝑃(𝑒 1 ) + 𝑃(𝑒 2 ) + ⋯ 𝑃(𝑒𝑛) = 1
A probability model lists the possible outcomes of a probability experiment and each outcome’s probability. It must satisfy rules 1 and 2 as well.
*Ex. In Peter Quill’s bag of peanut M&M milk chocolate candy, the colors of the candies can be brown, yellow, red, blue, orange, or green. Suppose that a candy is randomly selected from a bag. The table shows each color and the probability of drawing that color. Verify this is a probability model.
Color Brown Yellow Red Blue Orange Green Probability 0 0 0 0 0 0.
yes
If an event is impossible, the probability of the event is 0. If the event is a certainty, the probability of the event is 1. An unusual event is an event that has a low probability of occurring (usually less than 0 or 5%).
C) Empirical Method – the probability of an event 𝐸 is approximately the number of times event 𝐸 is observed divided by the number of repetitions of the experiment
𝑃(𝐸) =
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝐸
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙𝑠 𝑜𝑓 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡
*Ex. Stan Lee play roll the dice – a game where you roll a dice... Points are earned based on the way the dice lands. There are six possible outcomes. He rolled the dice 3,939 times (wow...). The number of times each outcome occurred is recorded below.
Outcome 1 2 3 4 5 6 Frequency 1344 1294 767 365 137 32
(1) Use the results of the experiment to build a probability model for the dice results. Round three decimal places as needed.
Outcome 1 2 3 4 5 6 Probability 0 0 0 0 0 0.
(2) Estimate the probability that a thrown dice lands on a 2.
E) Subjective Probability – a probability obtained on a basis of personal judgement. For example, a scientist like Dexter predicting there is a 20% chance of drought next year would be a subjective probability.
*Ex. Determine whether the following are empirical, classical, or subjective probability.
(1) In his fall article, Rocket Raccoon investigated the probabilities that a particular outcome will be rolled in a game of craps. He reports that these probabilities are based on the amount of money bet on each outcome.
(2) In his fall article, Groot investigated the probabilities that a particular outcome will be rolled in a game of craps. He reports that these probabilities are based on the previous one hundred rolls.
(3) In his fall article, Yondu investigated the probabilities that a particular outcome will be rolled in a game of craps. He reports that these probabilities are based on chances alone.
Sec 5 – Addition Rule and Complements
A) Disjoint Events – two events are disjoint or mutually exclusive if they have no outcomes in common. We often draw Venn diagrams to illustrate these events. The rectangle represents the sample space and each circle an event.
*Ex. Suppose Drax randomly selects a chip from a bag where each chip in the bag is labeled from 0 to 9. Let 𝐸 represent the event “choose a number less than or equal to 2” and 𝐹 the event “choose a number greater than or equal to 8”. These events are disjoint as shown below.
Determine the following probability 𝑃(𝐸), 𝑃(𝐹), 𝑃(𝐸 𝑜𝑟 𝐹), 𝑃(𝐸 𝑎𝑛𝑑 𝐹).
P(e)=0. P(f)=0. P(e or f)=0. P(e and f) = 0
Addition Rule for Disjoint Events – if 𝐸 and 𝐹 are disjoint events, then
𝑃(𝐸 𝑜𝑟 𝐹) = 𝑃(𝐸) + 𝑃(𝐹)
This can be extended to more than two disjoint events.
𝑃(𝐸 𝑜𝑟 𝐹 𝑜𝑟 𝐺 𝑜𝑟 ... ) = 𝑃(𝐸) + 𝑃(𝐹) + 𝑃(𝐺) + ⋯
*Ex. The probability model below shows the distribution of the number of Ravagers in each faction.
Number of Ravagers
1 2 3 4 5 6 7 8 9 or more Probability 0 0 0 0 0 0 0 0 0.
(2) A university conducted a survey of 375 undergraduate students regarding their favorite team.
Freshman Sophomore Junior Senior Total Team Iron Man 57 55 60 60 232 Team Captain America 24 15 10 13 62 Revengers 23 18 20 20 81 Total 104 88 90 93 375
a. If a survey participant is selected at random, what is the probability that he or she is team Iron Man?
b. What is the probability that he or she is team Iron Man and a junior?
c. What is the probability that he or she is team Iron Man or a junior?
C) Complement of an Event – Let 𝑆 denote the sample of a probability experiment and let 𝐸 denote an event. The complement of 𝐸, denoted 𝐸𝐶 , is all the outcomes in the sample space 𝑆 that are not outcomes in event 𝐸.
𝑃(𝐸𝐶) = 1 − 𝑃(𝐸)
*Ex.
(1) According to Agent Coulson, 31% of the S.H.I.E.L. agents own a pet. What is the probability that a randomly selected agent does not own a pet?
0.
(2) The data below represent the travel time to work the Avengers compound.
Travel Time Frequency Less than 5 minutes 24, 5 to 9 minutes 39, 10 to 14 minutes 62, 15 to 19 minutes 72, 20 to 24 minutes 74, 25 to 29 minutes 30, 30 to 34 minutes 45, 35 to 39 minutes 11, 40 to 44 minutes 8, 45 to 59 minutes 15, 60 to 89 minutes 5, 90 or more minutes 4,
a. What is the probability a randomly selected worker has a travel time of 90 or minutes? 0.
b. Compute the probability that a randomly selected worker will have a commute time less than 90 minutes.
0.
2022 Fall - MAT 152 - Unit 2
Course: Statistical Methods I (MAT 152)
University: Central Piedmont Community College
- Discover more from: