Stat 5870, section 2, Fall 2024. HW 9

Homework #9: Due 5 Nov, to canvas by 11:59 pm

Conceptual problems:
Chapter 7: 3, 4, 6, 7

Computational problems: (turn in)

1) 6 pts. The data in income.xlsx are a random sample of 258 US youth in 1979. The National Longitudinal Survey of Youth has followed these individuals through their lives. The Educ variable is each number of years of education for that person. This is classified into 5 categories, <12, 12, 13-15, 16 and >16. For those unfamiliar with the US educational system, <12 is a high school dropout, 12 is a high school graduate, 13-15 are those who have an associates degree or some college education, 16 is a college Bachelors degree and > 16 is at least some graduate work. The Educ.order variable assigns the integers 1, 2, 3, 4, 5 to the five Educ levels in order of increasing years of education. The Income2005 variable is their income in 2005. For this problem, the primary question is the value of graduate training, i.e. the mean difference in 2005 income between the >16 and the 16 groups. Secondary questions are whether or not all groups have the same average 2005 income, and if not, which groups differ from which other groups.

You have the data, the study description and the study goals. You decide how you want to analyze the data to answer these questions. You need to choose whether to report differences in means or multiplicative effects on medians. You should assess assumptions for the analyses you decide to use.

Your answer will include:
What methods you used, and where you had to make choices, a justification for those choices.
A plot of the data (or summaries of the data) that shows the relationship between Educ (or Educ.order) and 2005 Income.
Assessments of the assumptions
Answers to the study questions, with measures of uncertainty or strength of evidence (p-values) as appropriate.

Examples of writing up results are found in Summary of Statistical Findings paragraphs for each Case Study in the book.

2) 5 pts. In the year 1700, there were 6 known planets. The data in planet.csv tell you the distance from the sun to each planet. Although planetary orbits are elliptical not circular, there is an agreed-upon standard distance for each planet. The variable order numbers the planets from 1 (closest to the sun) to 7 (furthest from the sun). For reasons that will become obvious, 5 is skipped.

For all parts of this problem, remember that log means natural log (ln or log base e).
a) Fit a regression of Y = log transformed distance vs X=i, the order from the sun. Report the estimated intercept and slope, both with their standard errors.
b) Bode’s law (late 18’th century) in its simplest form claims that the distance from a planet to the sun is twice the distance of the previous planet to the sun. This law can be restated as log(distance) = constant + (log 2) * (order). Use a t-test to test whether the slope for your regression equals log(2). Report your T statistic.
c) The hypothesis in part b can be evaluated using a model comparison between a full and a reduced model.
full: log(distance) = intercept + (slope) * (order)
reduced: log(distance) = intercept + (log 2) * (order)
The SSE for the reduced model is 0.824. Report the ANOVA table for this model comparison and F statistic.
Note: The reduced model has only 1 parameter that needs to be estimated. You need to get the SSE for the full model from the computer then hand calculate the ANOVA table and F statistic.
d) Predict the distances at which a planet with order = 5 and a planet with order = 8 would be found.
Note: In the late 1700's, Bode predicted that planets would be found near where expected for orders 5 and 8. They were. Ceres was the first asteroid to be discovered (order 5). Uranus is order 8. Pluto doesn't follow Bode's law.
e) Report an appropriate 95% interval that describes the uncertainty in the predicted distance for the planet with order = 8.

3) 9 pts. Ecological theory predicts that the number of species in a habitat patch depends on the size of that patch. This theory has been tested in tropical forests scheduled to be clearcut (all trees removed). These data come from an experimental study where forest plots were selected randomly then randomly assigned to a patch size. The entire area was clearcut except for the designated non-clearcut patches of different sizes. The data in diversity.txt are the number of species of butterflies some years after the area was clearcut. Area is the area of the patch in hectares (ha); species is the number of butterfly species. Ecological theory suggests that the appropriate regression model is diversity = b0 + b1 * log(area).

a) Fit the regression model. Report the estimated slope and the p-value for the test of slope = 0.
b) Write a one sentence interpretation of the effect of area on diversity that includes a number quantifying that effect.
c) Use an ANOVA lack of fit test to evaluate whether the regression model is appropriate. Report the p-value and write a one-sentence conclusion.
d) The investigators are interested in using this model to predict the number of species on new patches of forest. What is the area of a patch that has the smallest standard error for predictions of the mean number of species?
e) What is the area of a patch that has the smallest standard error for predictions of number of species in an individual patch?
f) Predict the number of species that will be found in a 50 ha patch.
g) There is uncertainty in the prediction from part 3f. Report an appropriate 95% interval that describes the uncertainty in the mean number of butterfly species in a 50 ha patch.
h) Report an appropriate 95% interval that describes the uncertainty in the number of butterfly species that would be found in a single 50 ha patch.
i) How large a patch would be needed to preserve 100 species?
Note: remember that the regression model has X = logarea but the answer to this question is an area.