Let us start by finding our working directoty.
getwd()
[1] "/cloud/project"
# make sure the packages for this chapter
# are installed, install if necessary
pkg <- c("ggplot2", "scales", "maptools",
"sp", "maps", "grid", "car" )
new.pkg <- pkg[!(pkg %in% installed.packages())]
if (length(new.pkg)) {
install.packages(new.pkg)
}
Installing package into ‘/cloud/lib/x86_64-pc-linux-gnu-library/4.4’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘maptools’ is not available for this version of R
A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Let us now make sure that the file zeroaccess is uploaded into your new project. Then let us define the variable za.
# read the CSV with headers
za <- read.csv("zeroaccess.csv", header=T,sep ="," )
What is the output we have just obtained after running the chunk above. The output is that the dataset zeroaccess.csv has been successfully loaded into R as a data frame named za. We can confirm this by running:
View(za)
I can clearly see the dataset in another window in R.
# Load ggplot2 to create graphics
library(ggplot2)
# create a ggplot instance with zeroaccess data
gg <- ggplot(data=za, aes(x=long, y=lat))
# add the points, set transparency to 1/40th
gg <- gg + geom_point(size=1, color="#000099", alpha=1/40)
# add axes labels
gg <- gg + xlab("Longitude") + ylab("Latitude")
# simplify the theme for aesthetics
gg <- gg + theme_bw()
# this may take a while, over 800,000 points plotted
print(gg)
install.packages("mapproj")
Installing package into ‘/cloud/lib/x86_64-pc-linux-gnu-library/4.4’
(as ‘lib’ is unspecified)
trying URL 'http://rspm/default/__linux__/focal/latest/src/contrib/mapproj_1.2.11.tar.gz'
Content type 'application/x-gzip' length 50315 bytes (49 KB)
==================================================
downloaded 49 KB
* installing *binary* package ‘mapproj’ ...
* DONE (mapproj)
The downloaded source packages are in
‘/tmp/RtmptlBV5V/downloaded_packages’
library(mapproj)
Loading required package: maps
# requires package : ggplot2
# requires object: za (5-1)
# the "maps" and "mapproj" packages are used by ggplot
# load map data of the world
world <- map_data("world")
#Remove Antarctica
world <- subset(world, world$region!="Antarctica")
# load world data into ggplot object
gg <- ggplot(data=world, aes(x=long, y=lat))
# trace along the lat/long coords by group (countries)
gg <- gg + geom_path(aes(group=group), colour="gray70")
# now project using the mercator projection
# try different projections with ?mapproject
gg <- gg + coord_map("mercator", xlim=c(-200, 200))
# load up the ZeroAccess points, overiding the default data set
gg <- gg + geom_point(data=za, aes(long, lat),
colour="#000099", alpha=1/40, size=1)
# remove text, axes ticks, grid lines and do gray border on white
gg <- gg + theme(text=element_blank(),
axis.ticks=element_blank(),
panel.grid=element_blank(),
panel.background=element_rect(color="gray50",
fill="white"))
print(gg)
Interpret the output above.
The result is a world map displaying blue points that represent geospatial data from the ZA dataset. Rendered using the Mercator projection, the map highlights a dense concentration of points in North America, Europe, and parts of Asia, while fewer are observed in Africa and South America. Overall, the distribution of points is irregular. Let us proceed to define the file county.dataset found on Canvas.
# read up census data per county
county.data <- read.csv("countydataset.csv", header=T,sep = ",")
View(county.data)
set.seed(1)
# generate 200 random numbers around 10
input <- rnorm(200, mean=10)
summary(input)
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.785 9.386 9.951 10.036 10.613 12.402
Interpret the synatx and the output above. The code here begins by setting a seed to ensure reproducibility, guaranteeing that the same random numbers are generated with each run. It then creates 200 random values from a normal distribution with a mean of 10 and a standard deviation of 1. The summary(input) function provides key descriptive statistics, revealing a minimum value of 7.785 and a maximum of 12.402, defining the data range. The first quartile (9.386) and third quartile (10.613) show that the central 50% of values cluster around the mean. The median (9.951) is very close to the mean (10.036), indicating that the distribution is nearly symmetrical around 10.
install.packages("ggplot2")
Error in install.packages : Updating loaded packages
library(ggplot2)
# requires objects: input (5-16)
# generate output around a mean of 2 x input
output <- rnorm(200, mean=input*2)
# put into data frame to plot it
our.data <- data.frame(input, output)
gg <- ggplot(our.data, aes(input, output))
gg <- gg + geom_point()
gg <- gg + geom_smooth(method = "lm", se=F, color="red")
gg <- gg + theme_bw()
print(gg)
`geom_smooth()` using formula = 'y ~ x'
Make comments about both the syntax and the output of the task executed above. Is it possible to customize this graph to make it more explanatory? If so explain and run the code. The syntax follows the ggplot2 package in R, where geom_smooth() applies a linear regression model (y ~ x) to visualize the trend between input and output. The black points indicate individual data values, while the red line represents the fitted regression model. To enhance clarity and make the graph more informative, the points can be adjusted for better visibility by modifying their size and transparency using the size and alpha parameters in geom_point(). Additionally, the regression line can be thickened for emphasis, and a confidence interval (se = TRUE) can be included to provide a better understanding of variability. Furthermore, axis labels, a title, and customized colors can improve readability and presentation.
# Generate output around a mean of 2 x input
output <- rnorm(200, mean = input * 2)
# Put into a data frame
our.data <- data.frame(input, output)
# Create plot with enhancements
gg <- ggplot(our.data, aes(input, output)) +
geom_point(alpha = 0.6, size = 2, color = "blue") + # Add transparency and color
geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.2) + # Add confidence interval
labs(title = "Scatter Plot of Input vs. Output",
x = "Input Values",
y = "Output Values") + # Add labels
theme_minimal() # Use a cleaner theme
print(gg)
`geom_smooth()` using formula = 'y ~ x'
Let us now run output vs input.
model <- lm(output ~ input)
summary(model)
Call:
lm(formula = output ~ input)
Residuals:
Min 1Q Median 3Q Max
-2.9079 -0.6460 -0.0414 0.6721 3.7303
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.82971 0.82296 -1.008 0.315
input 2.07851 0.08166 25.454 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.07 on 198 degrees of freedom
Multiple R-squared: 0.7659, Adjusted R-squared: 0.7648
F-statistic: 647.9 on 1 and 198 DF, p-value: < 2.2e-16
Is the input significant at a 5% significance level?1%? The p-value for the input variable is < 2e-16, which is significantly smaller than both 0.05 (5%) and 0.01 (1%). Since the p-value is below these significance levels, we REJECT the null hypothesis that the coefficient of input is zero. This indicates that input is highly significant at both the 5% and 1% levels, strongly suggesting a meaningful relationship between input and output. Additional Insights: The coefficient estimate for input is 2.07851, meaning that for every unit increase in input, output increases by approximately 2.08 on average. The R-squared value is 0.7659, indicating that 76.59% of the variance in output is explained by input, which shows a strong linear relationship. The F-statistic of 647.9 with a p-value < 2.2e-16 confirms that the overall regression model is highly significant. Thus, input is not only statistically significant but also explains a substantial portion of the variation in output, reinforcing the strength of the model. Let us now proceed to get the confidence interval for this model.
confint(model)
2.5 % 97.5 %
(Intercept) -2.452598 0.7931737
input 1.917481 2.2395382
Using the county.data dataset, let us run a model of Infections vs ufo2010(Aliens Visits according to the UFO).
summary(lm(county.data$Infections ~ county.data$ufo2010, data= county.data))
Call:
lm(formula = county.data$Infections ~ county.data$ufo2010, data = county.data)
Residuals:
Min 1Q Median 3Q Max
-407.50 -70.35 -5.78 60.05 2736.39
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.77740 2.38569 52.302 < 2e-16 ***
county.data$ufo2010 0.56899 0.07998 7.114 1.4e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 127.5 on 3070 degrees of freedom
Multiple R-squared: 0.01622, Adjusted R-squared: 0.0159
F-statistic: 50.61 on 1 and 3070 DF, p-value: 1.398e-12
Given the output printed above, is ufo2010 significant at a 5% significance level?1%? The p-value is 1.4e-12, which is much smaller than both 0.05 (5%) and 0.01 (1%), leading us to REJECT the null hypothesis that the coefficient of ufo2010 is zero. This suggests that ufo2010 is highly significant at both the 5% and 1% levels, indicating a statistically significant relationship between UFO sightings in 2010 and infections. The coefficient estimate for county.data$ufo2010 is 0.56899, meaning that each additional UFO sighting in 2010 is associated with an average increase of approximately 0.569 infections. However, the R-squared value is 0.01622, indicating that only 1.6% of the variation in infections is explained by UFO sightings, suggesting a weak relationship. Despite the statistical significance, the low R-squared value points to the likelihood that other factors are more influential in explaining the variation in infections. Let us now run a model of Infections vs every single quantitative variable that is included in the dataset.
#View(county.data)
summary(lm(Infections ~ pop + income + ipaddr + ufo2010,
data=county.data))
Call:
lm(formula = Infections ~ pop + income + ipaddr + ufo2010, data = county.data)
Residuals:
Min 1Q Median 3Q Max
-342.94 -70.34 -5.40 58.85 2745.36
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.000e+02 9.426e+00 10.610 < 2e-16 ***
pop -1.615e-05 1.543e-05 -1.047 0.29510
income 5.670e-04 2.066e-04 2.745 0.00609 **
ipaddr -9.865e-08 5.148e-07 -0.192 0.84803
ufo2010 6.804e-01 1.691e-01 4.024 5.85e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 127.3 on 3067 degrees of freedom
Multiple R-squared: 0.01879, Adjusted R-squared: 0.01751
F-statistic: 14.69 on 4 and 3067 DF, p-value: 6.915e-12
Interpret the output. How would you proceed from now on in this handout given the results obtained above. The linear regression output reveals that certain predictors, like income and ufo2010, are statistically significant in explaining infections, while others, such as pop and ipaddr, are not. The p-value for income is 0.00609, indicating that higher income is significantly associated with an increase in infections. The p-value for ufo2010 is 5.85e-05, suggesting a significant positive relationship between UFO sightings and infections, although this relationship is likely weak. On the other hand, population and ipaddr have higher p-values (0.29510 and 0.84803, respectively), indicating that these variables are not statistically significant predictors of infections. The R-squared value is 0.01879, showing that only 1.88% of the variation in infections is explained by the model, suggesting a weak overall fit. Given these results, while some variables like income and ufo2010 are significant, the model overall has a weak explanatory power. In the next steps, it would be useful to explore adding more relevant predictors or transforming existing ones, or considering other modeling techniques to improve the model’s fit and explanatory power.
install.packages("carData")
Installing package into ‘/cloud/lib/x86_64-pc-linux-gnu-library/4.4’
(as ‘lib’ is unspecified)
trying URL 'http://rspm/default/__linux__/focal/latest/src/contrib/carData_3.0-5.tar.gz'
Content type 'application/x-gzip' length 1821260 bytes (1.7 MB)
==================================================
downloaded 1.7 MB
* installing *binary* package ‘carData’ ...
* DONE (carData)
The downloaded source packages are in
‘/tmp/RtmptlBV5V/downloaded_packages’
# Load the 'car' package (ensure it's installed first: install.packages("car"))
library(car)
Loading required package: carData
# Use the vif() function to check variance inflation factor for multicollinearity
# Replace 'model' with your actual regression model (e.g., lm_model)
# Fit a multiple regression model
full_model <- lm(Infections ~ pop + income + ipaddr + ufo2010, data=county.data)
# Check for multicollinearity
vif(full_model)
pop income ipaddr ufo2010
4.689208 1.078413 1.094222 4.475392
Let us just explore the variance inflation factor(VIF) of the model to see if there is a chance of high correlation between my predictors. I remind you that a strong correlation between two of my predictors will likely end up in heteroskedasticity, and therefore our model would not be accuarate.
model <- lm(Infections ~ pop + income + ipaddr + ufo2010,
data=county.data)
sqrt(vif(model))
pop income ipaddr ufo2010
2.165458 1.038467 1.046051 2.115512
All the VIF values in the model are relatively low, with none exceeding 5. Generally, a VIF value above 10 is considered a sign of problematic multicollinearity among predictors. Since all the values here fall well below that threshold, there is no strong evidence of multicollinearity. This suggests that the predictors in the model are not highly correlated with each other, meaning multicollinearity is unlikely to be a major concern. Let us see if population affects the number of infections in this dataset. Write the null and alternative hypothesis you would use to test this relationship.
summary(lm(Infections ~ pop, data=county.data))
Call:
lm(formula = Infections ~ pop, data = county.data)
Residuals:
Min 1Q Median 3Q Max
-365.26 -70.35 -5.70 59.59 2724.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.250e+02 2.416e+00 51.755 < 2e-16 ***
pop 4.228e-05 7.147e-06 5.916 3.67e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 127.8 on 3070 degrees of freedom
Multiple R-squared: 0.01127, Adjusted R-squared: 0.01095
F-statistic: 35 on 1 and 3070 DF, p-value: 3.668e-09
Interpret the results obtained above. The p-value for population is 3.67e-09, which is far below the 0.05 significance level. This leads us to REJECT the null hypothesis and conclude that population has a statistically significant effect on infections. However, the R-squared value is only 0.01127, indicating that population explains just 1.13% of the variation in infections. While the relationship is statistically significant, the low R-squared suggests that population alone has a minimal impact on predicting infections, and other factors likely play a much larger role. Now let us define the regression of Infections vs pop as pop.lm and predict the number of infections based on the variable population.
pop.lm <- lm(Infections ~ pop, data=county.data)
predict(pop.lm, data.frame(pop=6000000), interval="confidence")
fit lwr upr
1 378.7109 295.9209 461.5009
Interpret the results obtained above. The predicted number of infections for a population of 6 million is 378.71. The 95% confidence interval for this prediction ranges from 295.92 to 461.50, meaning we can be 95% confident that the actual number of infections for a population of 6 million falls within this interval. This interval provides a measure of uncertainty around the prediction, indicating that while 378.71 is the best estimate, the true value could reasonably be anywhere within the given range. Predict the number of infections based on the following values for the variable population: 1000000,2000000.
predict(pop.lm, data.frame(pop=c(1000000, 2000000)), interval="confidence")
fit lwr upr
1 167.3069 153.9224 180.6915
2 209.5877 182.5947 236.5807
These predictions show how the number of infections increases as the population increases.
Recreate this solution building a model that depends on income. Predict the number of infections based on the following values for the variable income: 10000,20000,40000,90000.
income.lm <- lm(Infections ~ income, data=county.data)
predict(income.lm, data.frame(income=c(10000, 20000, 40000, 90000)), interval="confidence")
fit lwr upr
1 100.2013 85.70806 114.6945
2 108.4966 97.66147 119.3318
3 125.0872 120.13584 130.0387
4 166.5638 148.35823 184.7694
This means that, for each given income level, we are 95% confident that the true number of infections falls within the corresponding interval. The predicted values give the best estimates, while the confidence intervals reflect the uncertainty in the predictions.