Stat 5870, section 2 - Homework #11: Due 3 Dec, 11:59 pm
Conceptual problems:
Chapter 9: 1, 2, 3, 4, 8, 11
Chapter 10: 2, 3, 4, 5, 7, 8
Chapter 11: 7,8
Computational problems: (turn in)
1) Air pollution and mortality: Problem 11:23 with my questions. Problem 11:23 in the book describes
the context for the
data in ex1123.csv.
Briefly, these data come from 60 US cities and their suburbs during 1959-1961.
This is before passage of the Clean Air Act, so air pollution levels in some areas were
very high. (As an aside, we are just a couple of weeks past the 76'th anniversary of the 1948 Donora (Pennsylvania)
smog event, which killed 20 during the event and ca 50 immediately following).
The focus of this question is the relationship between SO2, one particular air pollutant, and
overall mortality. The conceptual issue is these are observational data and cities vary
in many attributes besides SO2. Some of these attributes may be linked to mortality; others may be linked to SO2.
The data set includes
Mort: overall annual mortality, as deaths per 100,000 people
Precip: mean annual precipitation, in inches
Educ: median # years of school completed by folks older than 25
NonWhite: percent of 1960 population who are non-white
SO2: an annual measure of SO2 in the area, the book describes the units of this number. We will simply call the units "SO2 unit"
In all parts of this problem, the Y variable will be Mort; the units for Y can be simplified to "deaths".
We will not worry about transforming variables. In all cases, association between X and Y will be quantified
by a regression slope. We will ignore the NOX variable (it's another pollutant that we won't worry about).
a) Draw a scatterplot matrix showing the pairwise relationships between Mort, Educ, NonWhite, Precip and SO2. Your
answer is the plot.
b) Fit a simple linear regression using just SO2. Report the slope, including units, and its standard error.
c) Fit a multiple linear regression to estimate the association of SO2 and Mort controlling for
differences in Precip, Educ, and NonWhite. Report the slope for SO2, including units, and its standard error.
d) Explain in your own words why the estimated slopes in parts b and c are different.
e) Which slope, the SLR one in part b or the MLR one in part c, is more likely to be closer to the causal relationship between SO2 and mortality?
Hint: Think back to problem 2 on HW 10, which explored how each slope relates to comparisons between pairs of observations.
Subsequent parts are based on the MLR in part c
f) Plot the raw residuals against the standardized residuals. Why is the range of the
standardized residuals so much smaller than that of the raw residuals? Your answer is the
plot and your explanation.
g) Are there any concerns shown in the standardized residual vs predicted value plot? Include the plot
and a brief explanation with your answers.
h) Are there any influential points? Briefly explain your answer and include any supporting plots with your answer.
i) Is there any concern about multicollinearity? Briefly explain your answer.
2) Problem 10:28 (El Nino and Hurricanes), with my questions. This study examines
the association between hurricane activity in a particular year and two climatic variables:
the El Nino status (cold, neutral, or warm), and rainfall in West Africa (wet or dry). The data are in
ex1028.csv. The variables we will consider are:
Year: 1950 through 1997
ElNino: categorical variable: cold, neutral or warm
Temperature: same info as El Nino, but expressed as -1 (cold), 0 (neutral), and 1 (warm)
WestAfrica: indicator variable for wet or dry West African condition. This is 1 during wet conditions and
0 during dry conditions.
StormIndex: an aggregate measure of hurricane activity
We will only consider one response: StormIndex. We will ignore Storms and Hurricanes (counts of
tropical storms and hurricanes).
Our goal is to describe the association between storm index and the two climatic variables: El Nino temperature and West Africa rainfall condition. StormIndex will be the Y variable for all models. None of the variables in this problem have "real" units, so no need to include units in your answers. One issue that influences our choice of analysis is the likely existence of a temporal trend (Year) unrelated to El Nino or West Africa rainfall. Note: Be aware of the X variable(s) to use for each part. Different parts use different models.
a) Fit a MLR to describe the association between Storm index and temperature, controlling
for a linear temporal trend (Year). Use Temperature as a regression variable. Report the
regression coefficient (slope) for Temperature, its se, and the p-value for the test of slope = 0.
b) Fit a MLR to describe the association between Storm index and West Africa, controlling
for a linear temporal trend (Year) and Temperature. Use WestAfrica as a regression variable.
Report the regression coefficient (slope) for West Africa, its se, and its p-value.
c) Define WestAfrica as a factor/class/red bar variable and fit a MLR
to estimate the difference between wet years (WestAfrica = 1) and dry years (=0),
controlling for a linear temporal trend (Year) and Temperature.
Report the estimated difference (as wet - dry) and its p-value.
Note: SAS and JMP users may want to use LSMEANS to extract the information needed to be confident the
difference is calculated in
the "right" direction. (R users might also as a sanity/reality check)
d) Briefly explain why the estimates and p-values in parts b and c are identical?
Hint: Think about the interpretation of the slope in b and about how the values of the WestAfrica variable
relate to wet/dry conditions.
e) Questions b and c show that a factor variable with 2 levels and a regression on 0/1 values are the same model.
Is this also true for a variable with 3 levels, e.g., ElNino (cold / neutral / warm) compared to
Temperature (-1, 0, 1)? Briefly explain why or why not.
Note: This is a "think about" question, not an "analyze data" question.
f) You wonder whether a linear temporal trend is sufficient to describe
how storm intensity may have changed over time (Year), even for the same
levels of Temperature and WestAfrica. Fit a MLR with Year, Year^2, Temperature (as a regression variable),
and WestAfrica. Is a linear model for Year adequate? Briefly explain why or why not.
g) Use the coefficients from the quadratic model (question 2f) to estimate the year of the
minimum (or maximum) Storm Index. Report your estimate and whether this is a minimum or maximum.
h) Are there any concerns with multicollinearity in the model used in part 2f? Briefly explain
why or why not.
i) Calculate a "centered" Year variable as Year - 1974, refit the quadratic model using centered year
and centered year squared. Are there any concerns with multicollinearity in this model? Briefly explain
why or why not.
j) Your final question is whether the difference between wet and dry West African conditions
depends on the Temperature. Fit a MLR with Year, Year^2, Temperature, West Africa and a term that
allows you answer your final question. Report the term you added, the p-value for this term, and a one-sentence
conclusion.