Description
Alex Wako
1. The dataset trees contains measurements of Girth (tree diameter) in inches, Height in feet, and Volume of timber (in cubic feet) of a sample of 31 felled black cherry trees. The following commands can be used to read the data into R.
# the data set “trees” is contained in the R package “datasets”
require(datasets) head(trees)
## Girth Height Volume
## 1 8.3 70 10.3 ## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
(a) (1pt) Briefly describe the data set trees, i.e., how many observations (rows) and how many variables (columns) are there in the data set? What are the variable names?
## [1] 31 3
The trees data set has 31 rows and 3 variables. The names of the variables are Girth, Height, and Volume.
(b) (2pts) Use the pairs function to construct a scatter plot matrix of the logarithms of Girth, Height and Volume.
(c) (2pts) Use the cor function to determine the correlation matrix for the three (logged) variables.
## Girth Height Volume
## Girth 1.0000 0.5193 0.9671 ## Height 0.5193 1.0000 0.5982 ## Volume 0.9671 0.5982 1.0000
(d) (2pts) Are there missing values?
## [1] 0
There are no missing values.
(e) (2pts) Use the lm function in R to fit the multiple regression model:
log(V olumei) = ฮฒ0 + ฮฒ1 log(Girthi) + ฮฒ2 log(Heighti) + ฯตi
and print out the summary of the model fit.
##
## Call:
## lm(formula = log(y) ~ log(x1) + log(x2))
##
## Residuals:
## Min 1Q Median 3Q Max ## -0.16856 -0.04849 0.00243 0.06364 0.12922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.632 0.800 -8.29 5.1e-09 *** ## log(x1) 1.983 0.075 26.43 < 2e-16 ***
## log(x2) 1.117 0.204 5.46 7.8e-06 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 0.0814 on 28 degrees of freedom
## Multiple R-squared: 0.978, Adjusted R-squared: 0.976
## F-statistic: 613 on 2 and 28 DF, p-value: <2e-16
(f) (3pts) Create the design matrix (i.e., the matrix of predictor variables), X, for the model in (e), and verify that the least squares coefficient estimates in the summary output are given by the least squares formula: ฮฒห = (XTX)โ1XTy.
## (Intercept) log(x1) log(x2)
## 1 1 2.116 4.248 ## 2 1 2.152 4.174 ## 3 1 2.175 4.143 ## 4 1 2.351 4.277 ## 5 1 2.370 4.394 ## 6 1 2.380 4.419 ## 7 1 2.398 4.190 ## 8 1 2.398 4.317 ## 9 1 2.407 4.382 ## 10 1 2.416 4.317 ## 11 1 2.425 4.369 ## 12 1 2.434 4.331 ## 13 1 2.434 4.331 ## 14 1 2.460 4.234 ## 15 1 2.485 4.317 ## 16 1 2.557 4.304 ## 17 1 2.557 4.443 ## 18 1 2.588 4.454 ## 19 1 2.617 4.263 ## 20 1 2.625 4.159 ## 21 1 2.639 4.357 ## 22 1 2.653 4.382 ## 23 1 2.674 4.304 ## 24 1 2.773 4.277 ## 25 1 2.791 4.344 ## 26 1 2.851 4.394 ## 27 1 2.862 4.407 ## 28 1 2.885 4.382 ## 29 1 2.890 4.382 ## 30 1 2.890 4.382
## 31 1 3.025 4.466
## attr(,”assign”)
## [1] 0 1 2
## [,1] ## [1,] -6.632 ## [2,] 1.983 ## [3,] 1.117
The least squares coefficient given in the summary output matches the least squares coefficient found through ฮฒห = (XTX)โ1XTy.
(g) (3pts) Compute the predicted response values from the fitted regression model, the residuals, and an estimate of the error variance V ar(ฯต) = ฯ2.
Predicted response values: 2.3103, 2.2979, 2.3085, 2.8079, 2.9769, 3.0226, 2.8029, 2.9457, 3.0358, 2.9815,
3.0571, 3.0313, 3.0313, 2.9749, 3.1182, 3.2466, 3.4015, 3.4751, 3.3197, 3.2182, 3.4677, 3.5241, 3.4785, 3.643, 3.7549, 3.9295, 3.966, 3.9832, 3.9942, 3.9942, 4.3554
The residuals: 0.0219, 0.0343, 0.0138, -0.0106, -0.043, -0.042, -0.0557, -0.0443, 0.0822, 0.0093, 0.1292, 0.0132,
0.032, 0.0838, -0.1686, -0.1465, 0.119, -0.1645, -0.0732, -0.0033, 0.0733, -0.0678, 0.1134, 0.0024, -0.003, 0.0851,
0.054, 0.0824, -0.0527, -0.0624, -0.0116
Estimate of error variance: 0.0066
2. Consider the simple linear regression model:
yi = ฮฒ0 + ฮฒ1xi + ฯตi
Part 1: ฮฒ0 = 0
(a) (3pts) Assume ฮฒ0 = 0. What is the interpretation of this assumption? What is the implication on the regression line? What does the regression line plot look like?
When ฮฒ0 = 0, we can interpret the model as having no intercept. The errors are unobservable random variables with mean of 0 and variance of ฯ2. The mean of yi would be ฮฒ1xi, the variance of yi would be ฯ2, and the covariance would be 0. Therefore, the plot of the model would be a slope starting at the origin. The new model would be yi = ฮฒ1xi + ฯตi regression line.
(b) (4pts) Derive the LS estimate of ฮฒ1 when ฮฒ0 = 0.
arg minฮฒ0 SSR = Pni=1(yi โ ฮฒ0 โ ฮฒ1xi)2 n
โ X โ ฮฒ0 โ ฮฒ1xi)2
(yi
โฮฒ1 i=1
= Pni=1 xi(yi โ ฮฒ0 โ ฮฒ1xi) Since ฮฒ0 = 0:
= Pni=1 xi(yi โ ฮฒ1xi)
= Pni=1 xiyi
= Pni=1 xiyi โPni=1 ฮฒ1x2i
= Pni=1 xiyi โ ฮฒ1 Pni=1 x2i
=> Pn xiyi โ ฮฒ1 Pni=1 x2i = 0
i=1
=> Pni=1 xiyi = ฮฒ1 Pni=1 x2i
Pn xiyi
=> ฮฒ1 = Pi=1ni=1x2i
(c) (3pts) How can we introduce this assumption within the lm function?
We can introduce the assumption within the lm function by adding 0 into the formula to indicate that a constant does not exist in the model.
Part 2: ฮฒ1 = 0
(d) (3pts) For the same model, assume ฮฒ1 = 0. What is the interpretation of this assumption? What is the implication on the regression line? What does the regression line plot look like?
When ฮฒ1 = 1, we can interpret the model as not having a slope. The errors are still unobservable random variables with mean of 0 and variance of ฯ2. The mean of yi would be ฮฒ0, the variance of yi would still be ฯ2, and the covariance would still be 0. Therefore, the plot of the model would always be a constant horizontal line. The new model would be yi = ฮฒ0 + ฯตi regression line.
(e) (4pts)Derive the LS estimate of ฮฒ0 when ฮฒ1 = 0. arg minฮฒ0 SSR = Pni=1(yi โ ฮฒ0 โ ฮฒ1xi)2 n
X โ ฮฒ0 โ ฮฒ1xi)2
(yi i=1
= Pni=1(yi โ ฮฒ0 โ ฮฒ1xi)
= ny – nฮฒ0 – nฮฒ1x = 0 Since ฮฒ1 = 0:
(f) (3pts)How can we introduce this assumption within the lm function?
We can introduce the assumption within the lm function by creating a formula relating yi to only ฮฒ0.
3. Consider the simple linear regression model:
yi = ฮฒ0 + ฮฒ1xi + ฯตi
(a) (10pts) Use the LS estimation general result ฮฒห = (XTX)โ1XTy to find the explicit estimates for ฮฒ0 and ฮฒ1.
ฮฒห = (XTX)โ1XTy
P(XiโX)(YiโY ) SSxy We are trying to solve for ฮฒ0 and ฮฒ1, so for the purpose of this problem, let ฮฒ0 = P(XiโX)2 = SSx .
TX)โ1 = n1 Pxni PPxx2ii โ1
(X
P
= nSS1 x โPxxi2i โPxni
1 … 1 ๏ฃฎ y1 ๏ฃน
(XTy) = x1 … xn ๏ฃฐ … ๏ฃป yn
Pni=1 yi
= Pni=1 xiyi
(XTX)โ1XTy = nSS1 x โPPxxi2i โPxni PPni=1ni=1xiyyii
P
= nSS1 x โxP2i PxiyPi โyiP+xni PPxxiiyyii
= SS1x x2i โxiyxiPโxnixyyi
= SS1x x2i โ ynx2 + xnxy โ xPSSxixyyi
1xyx
= SSx SSxy
” # ฮฒ0
= SSxy => ฮฒ1
SSx
(b) (5pts) Show that the LS estimates ฮฒห0 and ฮฒห1 are unbiased estimates for ฮฒ0 and ฮฒ1 respectively.
Bias[ฮฒห1] = E[ฮฒห1] โ ฮฒ1
P(XiโX)(YiโY )
E[ฮฒห1] = E[ P(XiโX)2 ]
E[P(XiโX)YiโY P(XiโX)]
= P(XiโX)2
P(XiโX)E[Yi]
= P(XiโX)2
P(XiโX)(ฮฒ0โฮฒ1Xi)
= P(XiโX)2
ฮฒ0P(XiโX)+ฮฒ1P(XiโX)Xi
= P(XiโX)2
P(XiโX)(XiโX+X)
= ฮฒ1 P(XiโX)2
P(XiโX)2
= ฮฒ1 P(XiโX)2
= ฮฒ1
Bias[ฮฒห1] = ฮฒ1 โ ฮฒ1 = 0
Bias[ฮฒห0] = E[ฮฒห0] โ ฮฒ0 E[ฮฒห0] = E[Y โ ฮฒห1X]
= E[n1 PYi โ ฮฒห1X]
= n1 PE[Yi] โ E[ฮฒห1]X
= n1 P(ฮฒ0 + ฮฒ1Xi) โ ฮฒ1X
= ฮฒ0 + ฮฒ1X โ ฮฒ1X
= ฮฒ0
Bias[ฮฒห0] = ฮฒ0 โ ฮฒ0 = 0
Appendix
library(knitr) library(MASS)
# set global chunk options: images will be 7×5 inches knitr::opts_chunk$set(fig.width=7, fig.height=5) options(digits = 4)
# the data set “trees” is contained in the R package “datasets”
require(datasets) head(trees)
# The dimensions of the tree data set
(dim(trees))
# Scatter plot matrix of the three variables pairs(trees)
# Correlation matrix of the three variables cor(trees)
# Number of NA values
sum(is.na(trees))
# Creating variables representing y, x1, and x2 y <- trees$Volume x1 <- trees$Girth x2 <- trees$Height
# Fitting a linear model to the tree data using the given formula lm_tree <- lm(log(y) ~ log(x1) + log(x2)) summary(lm_tree)
model_matrix <- model.matrix(lm_tree) model_matrix
beta_hat <- ginv(t(model_matrix) %*% model_matrix) %*% t(model_matrix) %*% log(y) beta_hat
beta0_hat <- -6.631617 beta1_hat <- 1.982650 beta2_hat <- 1.117123
y_hat <- beta0_hat + beta1_hat * log(x1) + beta2_hat * log(x2) residual <- log(y) – y_hat
estimate_of_error <- sum((residual)ห2)/28
Reviews
There are no reviews yet.