Relevant conceptual problems:
Chapter 2: 5, 7, 10
Chapter 4: 1, 9, 10
Problems to be handed in (17 pts total, 3 pts "for free"):
1) The data for this problem are based on recent ISU studies on ways to encourage physical activity in school children.
150 elementary school children (ages 9-11) from across Iowa participated in the study. The response variable is the
average minutes of moderate or vigorous physical activity (MVPA) per day. At the start of the study,
this was measured for 3 weeks in September and October. Over the next year, the children's schools then conducted various activities
intended to increase physical activity. All children received the same treatment. One year later, physical activity was
measured in the same children, again for 3 weeks in September and October. The data are provided in two formats:
PAwide.csv: one row per child. ID is the child's id number. Pre is the average minutes of MVPA at the start of the study. Post is the average MVPA
one year later.
PAlong.csv: two rows of data per child. ID is the child's id number. Period is Pre or Post, indicating when that observation
was taken. mvpa is the average minutes of MVPA in that period.
a) 1 pt. Should these data be analyzed using a paired T test or a two sample T test? Briefly explain your choice.
b) 1 pt. Estimate the average change in minutes of MVPA (as Post - Pre). Make sure to include units.
c) 1 pt. Calculate the standard error of the average change in minutes of MVPA. Make sure to include units.
d) 2 pt. Test the null hypothesis of no change between pre and post measurements. Report the p-value and an appropriate one-sentence conclusion from the test.
conclusion
e) 1 pt. Report the 95% confidence interval for the average change in minutes of MVPA (as Post - Pre).
f) 1 pt. This study has a serious weakness. The change from Pre to Post minutes of MVPA may be a consequence of the
treatment, or it may have other causes (e.g., the children are one year older, the weather might have been colder
when the study started). Describe a simple change in the study
design that allows you to separate the treatment effect from other potential causes.
Hint: Think about some of the studies we have seen so far.
Note: We haven't talked about this - I want you to think about the design and how a modified design could alleviate potential issues with interpretation of the results.
2) The data for this problem come from a study of quality of life in women treated for metastatic breast cancer. Women were randomly assigned to a control group who received "usual care" and a treatment group who received weekly sessions of group therapy and self-hypnosis. Follow up continued for 122 months (just longer than 10 years). The study was set up to look at quality of life, but after 10 years, it was noticed that women in the treatment group appeared to live longer than those in the control group. This analysis follows up on longevity. Most women died during the follow up period, but 3 women were still alive.
The data are in
cancer.csv The four variables are:
Survival: survival time, in months
Group: randomly assigned treatment (Control or Therapy)
Censor: 0 for a woman who died, 1 for a woman who was alive at the 122 month followup
SurvivalC: survival time, in months, with those three living women changed to 123
This change was made so that the censored individuals (survival > 122 months) have a value larger than 122.
a) 1 pt. Draw a boxplot of the survival times in the two groups of women. Use a value of 123 for those women still
alive at the 122 month follow up period. Your answer is the plot.
b) 1 pt. Estimate and report the median survival time for each of the two groups of women.
c) 2 pt. Use an appropriate method to test the null hypothesis that the treatment has no effect
on the median survival time. Report your p-value and an appropriate one-sentence conclusion from this test.
d) 2 pt. Give two reasons why it is more
appropriate to report medians, not means, for these data.
For at least one of the reasons, remember that most medical studies make recommendations for individuals.
3) The data in (darwin.csv) are a small fraction of Charles Darwin's studies of cross- and self-fertilized plants. He reported on all his studies in a 1878 book. These data are from corn. Darwin hand pollinated plants to obtain cross-fertilized seedlings and self-fertilized seedlings. One cross-fertilized seedling and one self-fertilized seedling were planted next to each other in pots and placed in a greenhouse. Darwin claims that the two plants in each pair were in practically similar environments. We previously looked at these data in the lab 4 self assessment. Here, we consider a different analysis.
a) 1 pt. Draw a dotplot of the differences (as cross - self). Your answer is the plot.
b) 1 pt. What, if anything, in the dotplot suggests that a paired t-test is not appropriate? Briefly explain your answer.
c) 2 pts. Use an appropriate non-parametric method to test whether there is a difference between
between self- and cross-fertilized seedlings. Report your p-value and an appropriate one-sentence conclusion.
General advice:
1) Be sure to check which way around your computer program (SAS, R, JMP)
is computing the difference.
2) For questions requesting numeric answers, copy the appropriate numbers from computer output, for questions requesting numeric answers.
3) You do NOT need to hand in copies of computer output. You should keep
a copy of your code so you can compare with our code, in case you get different
results.
4) If you use Rmarkdown, please do not include the Rmarkdown output (i.e., code + unfiltered results). See advice point 2).