Stat 5870 section 2 Homework #12: Due 10 December, to canvas by 11:59 pm

Conceptual problems:
Chapter 12: 1, 2, 4, 6a

Computational problems: (turn in)

1 point for each part. Total of 20 points.

0) 1 pt. What software are you using?

1) Voting issues have been in the news, but issues in Florida were seriously big news in the 2000 presidential election. One of those issues was whether a confusing ballot layout in one specific county, Palm Beach, shifted votes from Al Gore to Patrick Buchanan. We will repeat one of the analyses done to explore this possibility. Problem 12:22 gives a full explanation of the background. The short version is that the goal is to predict Buchanan2000, the log transformed vote for Buchanan, in Palm Beach County, so we can evaluate whether there were unexpectedly many votes there for Buchanan. The available data include: the county name, the number of votes for the 5 candidates on the 2000 presidential ballot (Buchanan, Gore, Bush, Nader, and Browne), and 7 other pieces of information about each county ( e.g., the total vote, votes in the 1996 presidential election, votes for Buchanan in the 1996 primary, and number of registered voters). Evaluation of the data suggests that all variables should be log transformed prior to model fitting. This has been done for you in all data files. To fairly predict Buchanan2000 in Palm Beach county, the modeling needs to ignore the actual Buchanan2000 value for Palm Beach. There are two ways to do this. I provide the data both ways, because software works best with different formats and I have done all the reformatting for you.
The simplest is to set the Palm Beach value to a missing value. This retains all the other information about Palm Beach county, so you can make the predictions for parts e and g. This has been done for you in the lvotes.csv file. SAS or JMP users will want to use this file. Each row is data for one of the 67 counties in Florida. If you get predictions for all observations, you will get the prediction for Palm Beach. R users (unless you use something other than dredge() in the MuMIn library), need a data set without any missing values. The actual requirement is no missing values in the explanatory variables, but when dredge checks, it doesn't distinguish between a missing explanatory value and a missing Buchanan2000 value. lvotes66.csv file has the data for the 66 counties, not including Palm Beach. You will want to use this file to fit models. lvotesPB.csv has the covariate information for Palm Beach Co, which you can use with the predict() function to get the prediction for Palm Beach.

Consider models with all subsets of variables. That is, all models with one variable, all models with two variables, three variables, ..., through the model with all 11 variables. Do not consider models with cross-product or quadratic variables.

a) Using AICc (or AIC if AICc not easily available) as the criterion, what is the best set of variables to predict Buchanan2000?
b) Using BIC as the criterion, what is the best set of variables to predict Buchanan2000?
Note: SAS and JMP users will want to request information about many models. With SAS, I suggest best=2000. Also, note that SBC is SAS's name for what we call BIC. (SAS's BIC is something else, not frequently used now). With JMP, I suggest 10 models per number of variables in the model.
c) A colleague tells you that political considerations suggest that the 6 variable model with nader2000 browne2000 total2000 clinton96 perot96 and totalreg should be reasonable. Using AIC, do the data suggest this model is a reasonable alternativhe to the best model? Briefly explain why or why not.
d) Is the model in 1c a reasonable alternative to the best model when using BIC? If the answer differs from that in 1c, briefly explain why.
e) Using the "best AIC" model, what is the predicted number of votes for Buchanan in Palm Beach county?
Remember the Y variable is log(count) and the question asks about predicted count.
f) If you want to know whether the observed number of votes for Buchanan is unexpectedly large, should you:
f1) compare the observed number (3407) to your prediction in part e
f2) compare the observed number to the confidence interval for the prediction in part e
f3) or, compare the observed number to the prediction interval for the prediction in part e).
Your answer is your choice (f1, f2, or f3) and a brief explanation.
Note: This is a "think-about" question, no computing required.
g) How robust is the prediction in part e to the choice of model? Evaluate this by seeing whether the estimated number of votes changes substantially when you use the "best BIC" model instead of the "best AIC" model. Report both estimated numbers of votes and your sense of whether these are similar or not.
h) Calculate the error SS (= residual Sum of Squares for some software) for 3 models: the best-AICc (or AIC) model (1a), the best-BIC model (1b) and a new model using all variables to predict Buchanan2000. Which model has the smallest SSE? Report the 3 SSE values and your answer.
i) Calculate the PRESS statistic for the 3 models used in 1h. Which model has the smallest PRESS statistic? Report the 3 PRESS values and your answer.
j) You should have noticed that the PRESS value for a model is larger than the SSE value for that model. Is this to be expected? Briefly explain why or why not.
k) If you want to predict the Buchanan2000 vote, should you use the model with the smallest SSE or the model with the smallest PRESS? Briefly explain why or why not.
Note: The approach outlined here is commonly used to detect fraud in many different settings.
I used a conceptually similar approach to estimate the impact of an oil spill on sea turtle nesting.

2) Evaluating the wage difference between men and women. The data in wage.csv a subset of 500 US adults in the Current Population Summary, a major Census Bureau survey that collects more detailed information on characteristics of the US resident population. We will use these data to evaluate whether there is a wage differential between men and women. In other words, are women paid the same as men when compared with similar backgrounds. The problem (with this and many similar analyses) is figuring out what is "similar backgrounds". This is an example of one of the major uses of model selection discussed in lecture.

We will focus on the relationship between Y = logwage and X = Sex (0 = Male or 1 = Female). logwage is log(WeeklyEarnings). The additional characteristics we'll consider are the other variables in the data set (Region, MetropolitanStatus, Age (continuous), MaritalStatus, EdCode (continuous), and JobClass). EdCode is essentially the number of years of education, although it starts at 4 (no more than 4th grade) and ends at 18 (PhD degree). I have created one indicator variable for each of the categorical characteristics, Region: 1 = South/0 = Other, MetropolitanStatus: 1 = Not/0 = Is, MaritalStatus: 1 = Not/0 = Is, JobClass: 1 = Federal/0 = other. I have created quadratic versions of Age and EdCode; these are Age2 and EdCode2. I have done initial data screening, extracted a subset of 500 individuals, and log transformed the response (logwage) for you. Use the data in wage.csv. This has the indicator and quadratic variables that I've created for you. If you want to see the original data with the categorical variables, it's in wageOrig.csv.

a) Fit a simple linear regression using only Sex to predict log(WeeklyEarnings). Report the coefficient for Sex and its standard error.
b) Include all 9 variables (Region through JobClass plus Age2, EdCode2 and Sex) in a multiple linear regression model to predict log(WeeklyEarnings). Report the coefficient for Sex and its standard error.
c) Use model selection on the 8 variables (Region through JobClass plus Age2 and EdCode2) BUT not including Sex to identify which of which characteristics should be used to predict log(WeeklyEarnings). Use AICc (or AIC if AICc not easily available). Which variables are included in this model?
d) Repeat part c using BIC (SBC in SAS). Which variables are included in this model?
e) Return to the AICc (or AIC) chosen model from 2c and add Sex to that model. Report the coefficient for Sex and its standard error.
f) You have 3 estimates of the difference (in log Weekly Earnings) between men and women: from the Sex only model (2a), the all variables model (2b) and the subset of variables. Do all 3 models give similar estimates of the difference? If not, which model is different?
g) Do all 3 models give similar standard errors of the estimate? If not, which model is different?
h) Report the 95\% confidence interval for the coefficient in part 2e.
i) (Reminder about interpreting results on log transformed responses from a while ago): Use the results from the model in 2c to fill in the blanks in this sentence about the results:
Median salary for women is _____ % smaller (95% confidence interval: _____, _____) than that of men with similar characteristics.