Solutions to midterm 2 2008 I. > mynorms <- rnorm( 8, 2, 1) > mynorms [1] 1.863169 1.852449 1.128796 1.747202 2.199574 1.641059 3.086841 2.409171 > myexps <- rexp(10, 0.5) # parameter = 1/mean > myexps [1] 0.00684286 0.85222979 3.54068115 6.80401577 0.43420004 2.52013580 [7] 0.65344645 1.58676830 5.16518914 2.16818723 > wilcoxout <- wilcox.test( mynorms, myexps) > wilcoxout$p.value [1] 0.828557 # Set up for simulation study > S <- 1000 > nnorm <- 8 > nexp <- 10 # matrix with 1000 rows. each row is a sample of size 8 > mynormmat <- matrix( rnorm( S * nnorm, mean = 2), nrow = S) # matrix with 1000 rows. each row is sample of size 10 > myexpmat <- matrix( rexp( S * nexp, 0.5), nrow = S) # initialize vector to hold calculated p values > pvals <- rep(0, S) # do the 1000 Wilcoxon tests > for( i in 1:S) + pvals[i] <- wilcox.test( mynormmat[i,], myexpmat[i,] )$p.value # calculate the proportion of rejections > length( pvals[ pvals < 0.05] ) / S [1] 0.085 The test is anticonservative because the estimated actual size (0.085) is greater than the nominal size (0.05). The standard error of the estimated size is > sqrt( 0.085 * (1-0.085) / 1000) [1] 0.008819014 Since the estimated actual size minus 2 standard errors (0.085 - 2 * 0.0088 = 0.0674), the difference is probably real (not due to random MC sampling variability) II. There are two entities described, states and representatives. We need a table for each entity, and a common field to link them. States table state name state abbreviation Primary Key capital city population at time of last census Representatives table representative ID Primary Key state abbreviation Foreign key representative name representative political party year representative was elected representative's district The state abbreviation is the best choice of primary key for the state table as it is short and guaranteed to be unique. State name would also be ok. We need an artificial primary key for the representatives table, as there could always be representatives with the same name. III. > BaP <- read.table("/group/ftp/pub/kcowles/datasets/BaP.dat") > BaP V1 V2 1 10 24 2 10 35 3 25 41 4 40 65 5 40 27 6 45 56 7 45 67 8 55 50 9 55 25 10 70 20 11 75 78 12 90 25 13 220 38 14 285 40 > attach(BaP) > cor.test(BaP$V1, BaP$V2, method="spearman") Spearman's rank correlation rho data: BaP$V1 and BaP$V2 S = 439.4143, p-value = 0.9075 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.03425433 Warning message: In cor.test.default(BaP$V1, BaP$V2, method = "spearman") : Cannot compute exact p-values with ties So the point estimate is 0.0343. (The message about inability to calculate exact p-values isn't a problem, since we don't need the p-value!) III. Function jackspear <- function() { BaP <- read.table("/group/ftp/pub/kcowles/datasets/BaP.dat") n <- nrow(BaP) thetahat <- rep(0, n) thetawig <- rep(0, n) spearcorn <- cor.test(BaP$V1, BaP$V2, method="spearman")$estimate # for(i in 1:n) { thetahat[i] <- cor.test(BaP$V1[-i], BaP$V2[-i], method="spearman")$estimate thetawig[i] <- n * spearcorn - (n - 1) * thetahat[i] } thetawig.dot <- mean(thetawig) var.spearcor <- sum((thetawig - thetawig.dot)^2)/(n * (n - 1)) list(spearcorn = spearcorn, thetahat = thetahat, thetawig = thetawig, thetawig.dot = thetawig.dot, var.spearcor = var.spearcor) } Output > jackspear() $spearcorn rho 0.03425433 $thetahat [1] -0.14779062 -0.04281784 0.06224120 0.08425447 -0.01795587 0.07044226 [7] 0.06215493 0.05801127 0.08275933 0.13693064 -0.07883885 0.14364860 [13] 0.03457844 0.03457844 $thetawig [1] 2.40083869 1.03619260 -0.32957495 -0.61574740 0.71298695 -0.43618870 [7] -0.32845349 -0.27458588 -0.59631063 -1.30053767 1.50446573 -1.38787117 [13] 0.03004087 0.03004087 $thetawig.dot [1] 0.03180685 $var.spearcor [1] 0.07929454 There were 15 warnings (use warnings() to see them) So the bias-corrected estimate is 0.0318 and the standard error is sqrt(0.07929) = 0.2816.