Description
STA442 Methods of Applied Statistics
1 Smoking (20 marks)
This task concerns the 2014 American National Youth Tobacco Survey. On the pbrown.ca/teaching/appliedstats/data page there is an R version of the 2014 dataset smoke.RData, a pdf documentation file 2014-Codebook.pdf, and the code used to create the R version of the data smokingData.R.
The hypotheses to be investigated are:
1. Geographic variation (between states) in the mean age children first try cigarettes is substantially greater than variation amongst schools. As a result, tobacco control programs should target the states with the earliest smoking ages and not concern themselves with finding particular schools where smoking is a problem.
The collaborating scientists have provided the following prior information
• The variability in the rate of smoking initiation between states substantial, with some states having double or triple the rate of smoking update compared other states for comparable individuals. If 𝑈𝑖 is a random effect for state 𝑖, we might see exp(𝑈𝑖) = 2 or 3 but unlikely to see at 10.
• A flat hazard function is expected, so the prior on the Weibull shape parameter should allow for a 1 but it is not believed that shape parameter is 4 or 5.
Write a short consulting report addressing these hypotheses. Some additional notes:
• Show graphs of prior and posterior densities of model parameters related to the research questions.
• Interpret your model parameters in the context of the smoking problem, transforming model parameters to a more ‘natural’ scale as necessary.
• It is important to state precisely what your prior distributions are (i.e. a Gamma(0.4, 3.1) distribution for the log of the intercept parameter), but also show how these distributions are consistent with the prior assumptions by showing quantiles or means or tail probabilities.
• You’re given three confounders (sex, rural/urban, ethnicity), it’s up to you if you’d like to include interactions.
• You might want to fit more than one model, either as exploratory work or sensitivity assessments, but you should use a single ‘best’ model to answer the research questions. Fitting two models and selecting one of them with a fairly ad hoc explanation is fine, comparing 10 models without some sort of formal assessment (a topic we haven’t covered) wouldn’t be.
smokeFile = Pmisc::downloadIfOld(“http://pbrown.ca/teaching/appliedstats/data/smoke.RData”) Loading required namespace: R.utils load(smokeFile) smoke = smoke[smoke$Age > 9, ]
forInla = smoke[, c(“Age”, “Age_first_tried_cigt_smkg”,
param = c(0.5, 0.05)))) + f(state, model = “iid”,
hyper = list(prec = list(prior = “pc.prec”, param = c(0.5,
0.05)))), control.family = list(variant = 1, hyper = list(alpha = list(prior = “normal”, param = c(log(4), (2/3)^(-2))))), control.mode = list(theta = c(8,
2, 5), restart = TRUE), data = forInla, family = “weibullsurv”, verbose = TRUE)
rbind(fitS2$summary.fixed[, c(“mean”, “0.025quant”, “0.975quant”)], Pmisc::priorPostSd(fitS2)$summary[, c(“mean”, “0.025quant”, “0.975quant”)])
mean 0.025quant 0.975quant
(Intercept) -0.618123774 -0.673217580 -0.562381805
RuralUrbanRural 0.114219840 0.054982509 0.173126077
SexF -0.050079551 -0.078482659 -0.021834950
Raceblack -0.048030138 -0.090878915 -0.005850117
Racehispanic 0.025707830 -0.008877249 0.060088014
Raceasian -0.194755194 -0.286897781 -0.108234511
Racenative 0.110090096 0.004857814 0.207960713
Racepacific 0.175543344 0.008688396 0.324061510
SexF:Raceblack -0.016889846 -0.073916811 0.039994596
SexF:Racehispanic 0.016228302 -0.029712823 0.062147624
SexF:Raceasian 0.005526784 -0.121718791 0.131914982
SexF:Racenative -0.043646334 -0.200293351 0.109720191
SD for state 0.056808074 0.024521813 0.101231683
2 Death on the roads (20 marks)
The dataset below is a subset of the data from www.gov.uk/government/statistical-data-sets/ras30-reportedcasualties-in-road-accidents, with all of the road traffic accidents in the UK from 1979 to 2015. The data below consist of all pedestrians involved in motor vehicle accidents with either fatal or slight injuries (pedestrians with moderate injuries have been removed).
dim(pedestrians)
[1] 1159453 6
pedestrians[1:3, ]
time age sex Casualty_Severity Light_Conditions
54 1979-01-01 22:40:00 26 – 35 Male Slight Darkness – lights lit
65 1979-01-02 10:40:00 26 – 35 Male Slight Daylight
79 1979-01-02 14:25:00 46 – 55 Male Weather_Conditions 54 Snowing no high winds
65 Raining no high winds
79 Raining no high winds Slight Daylight
table(pedestrians$Casualty_Severity, pedestrians$sex)
Male Female Slight 637977 481832
Fatal 24432 15212
range(pedestrians$time)
[1] “1979-01-01 01:00:00 EST” “2015-12-31 23:35:00 EST”
Write a short report assessing whether the UK road accident data are consistent with the hypothesis that women tend to be, on average, safer as pedestrians than men, particularly as teenagers and in early adulthood. Treat fatal accidents as cases and slight injuries as controls, and use a conditional logistic regression to adjust for time of day, lighting conditions, and weather. Make your report self-contained so it can be read by someone who has not seen this homework sheet. Some (but not all) of the code below could be helpful. Explain clearly how you have stratified the data (you could use a different stratification than I did if you wish).
Fit a glm
summary(glm(y ~ sex + age + Light_Conditions + Weather_Conditions,
data = x, family = “binomial”))$coef[1:4, ]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.2507678 0.02370768 -137.118754 0.000000e+00 sexFemale -0.2988120 0.01245826 -23.985048 3.983187e-127 age0 – 5 0.1124659 0.03448000 3.261772 1.107180e-03 age6 – 10 -0.4369786 0.03241323 -13.481489 2.010085e-41 fit a conditional logistic model
library(“survival”)
theClogit = clogit(y ~ age + age:sex + strata(strata),
data = x)
Some results download data
(a) males (b) females
Figure 1: some results without adequate explanation
pedestrainFile = Pmisc::downloadIfOld(“http://pbrown.ca/teaching/appliedstats/data/pedestrians.rds”) pedestrians = readRDS(pedestrainFile)
pedestrians = pedestrians[!is.na(pedestrians$time), ] pedestrians$y = pedestrians$Casualty_Severity == “Fatal” pedestrians$timeCat = format(pedestrians$time, “%Y_%b_%a_h%H”) pedestrians$strata = paste(pedestrians$Light_Conditions,
pedestrians$Weather_Conditions, pedestrians$timeCat)
remove strata with no cases or no controls
theTable = table(pedestrians$strata, pedestrians$y) onlyOne = rownames(theTable)[which(theTable[, 1] ==
0 | theTable[, 2] == 0)] x = pedestrians[!pedestrians$strata %in% onlyOne, ]
theCoef = rbind(as.data.frame(summary(theClogit)$coef),
`age 26 – 35` = c(0, 1, 0, NA, NA))
theCoef$sex = c(“Male”, “Female”)[1 + grepl(“Female”,
rownames(theCoef))]
theCoef$age = as.numeric(gsub(“age|Over| – [[:digit:]].*|[:].*”,
“”, rownames(theCoef))) theCoef = theCoef[order(theCoef$sex, theCoef$age),
] matplot(theCoef[theCoef$sex == “Male”, “age”], exp(as.matrix(theCoef[theCoef$sex ==
“Male”, c(“coef”, “se(coef)”)]) %*% Pmisc::ciMat(0.99)), log = “y”, type = “l”, col = “black”, lty = c(1,
2, 2), xaxs = “i”, yaxs = “i”)
matplot(theCoef[theCoef$sex == “Female”, “age”], exp(as.matrix(theCoef[theCoef$sex ==
“Female”, c(“coef”, “se(coef)”)]) %*% Pmisc::ciMat(0.99)), log = “y”, type = “l”, col = “black”, lty = c(1,
2, 2), xaxs = “i”)
Reviews
There are no reviews yet.